Using Reinforcement Learning and LLMs to Optimize Prompts

We’ve written a few times about using LLMs to write/enhance prompts (Using LLMs to Optimize Your Prompts, How to Optimize Long Prompts, RecPrompt). Most of these methods use actions like adding, removing, or replacing parts of a prompt, trying the new versions, and iterating to find the best one.

Generally, these methods use a “frozen” (static) unmodified LLM to evaluate the effectiveness of the new prompts. But what if you trained the evaluator LLM? Enter PRewrite! PRewrite comes from a recent paper from Google: PRewrite: Prompt Rewriting with Reinforcement Learning.

PRewrite’s automated prompt engineering framework differentiates itself through using reinforcement learning to refine the LLM used to rewrite prompts.

How so? Let's dive in and see if this new framework will replace prompt engineering.

What is PRewrite

PRewrite is an automated framework for optimizing prompts.

The biggest difference between PRewrite and other automated prompt optimization frameworks is the use of a reinforcement learning loop. This loop enables the Prompt Rewriter to continually improve using a reward computed on the generated output against the ground-truth output.

Simply put, the Prompt Rewriter gets fine-tuned based on previously enhanced prompts.

‍

PRewrite flow with reinforcement learning

‍

PRewrite components

As depicted above PRewrite has a few components:

Prompts

Here are a few examples of the prompts used in the PRewrite process.

Initial Prompt: The original hand-written prompt. This is the starting point.

‍

Meta Prompt: Rewrites the initial prompt

‍

Rewritten Prompt: The final rewritten prompt generated by the Prompt Rewriter

‍

Hey, how's it going? Dan here, co-founder of PromptHub, and today we're going to talk about how you can use an automated system to let LLMs optimize your prompt using reinforcement learning. I promise it will be a lot simpler than it sounds. We've written a lot about using LLMs to optimize LLMs—to optimize your prompts—and you can catch most of these articles on our blog. We haven't been blown away by any of them for the most part because we've found that the best process is using a human plus an LLM in some capacity to help optimize your prompts. We still believe that today, even after reading through the last paper that we'll be discussing.

There have been a ton of papers around this, and a lot of times they just don't present evidence that is super overwhelming. I haven't seen a lot of them actually used in practice, and when we've tested them, we found that human intervention using an LLM as a place to draw inspiration has been the best process. A lot of the times, the reason for this is that the frameworks or methods used in many of these papers use a kind of frozen LLM to optimize the prompts, meaning they're just using the base out-of-the-box LLM as an optimizer. There's not a lot of fine-tuning. Usually, the innovation happens on a different level, other than the LLM, maybe in the prompt or in the kind of framework setup. But the LLM that ends up actually optimizing the prompt generally stays frozen.

That is the major difference between what we're going to talk about today and the other methods. Today we're talking about Pre-Write, which is a prompt rewriting optimization method using reinforcement learning. Its biggest differentiator is that the LLM used to rewrite the prompts continually gets fine-tuned based on a certain reward, meaning that it aligns with the outputs getting better.

Typical flow here: the initial prompt comes in, it gets fed to a prompt rewriter, and the rewritten prompt goes to a frozen LLM to execute the task. The output comes out and is compared against some ground truth based on some metric, which we'll look at. If it does well or not, that is noted, and the reward is sent back to the prompt rewriter to improve the process. In the case that the task output is good, that will be signaled to the prompt rewriter, and if it's bad, that will be signaled as well, and changes will be made.

There are a couple of major components: the policy function and the rewards. Let's start with an example. Here's an initial prompt: "Answer the question." It gets fed to the meta prompt rewriter, which says, "Rewrite the prompt, add specific requirements," etc. This could be the initial rewritten prompt, which is then fed to the frozen LLM for task generation.

More under the hood: the big component is the policy function, which is just an algorithm that guides the rewriter model to make decisions that will enhance the prompt based on whatever reward it's trying to optimize. It's basically looking at all the actions it could take—adding, removing, or altering tokens—and the policy function helps decide which action has the highest probability of generating more rewards or better outputs based on historical runs.

An everyday example of a policy function: crossing the street. You evaluate the current state, check the lights, cars, distances, and potential actions (wait, start walking, speed up, slow down). The reward here is to maximize safety. Over time, as you cross more streets, you learn which actions lead to maximum safety, refining your policy function. Similarly, Pre-Write's model evaluates which actions (token changes) to take based on historical data to maximize the reward, which is determined by the effectiveness of the final rewritten prompt.

Talking about rewards: this paper looked at five different ones.

Exact Match (EM): Used for classifying data where the answer should be exact (e.g., X, Y, or Z).
F1 Score: Combines precision and recall, measuring how often the model gets it correct out of how many times there are actually correct responses.
Perplexity: Measures the model's predictive certainty. Lower values indicate the model was less surprised by the token sequence. For example, "The dog ate a bone" would have low perplexity, while "The dog ate a raccoon with clams" would have high perplexity.
Perplexity and F1: A combination of the two.
Length Difference: Compares the output length to the ground truth.

Here's how it comes together: you start with an initial prompt that gets rewritten by the meta prompt rewriter. Multiple prompt variants are generated, fed into the LLM to produce outputs, which are then judged against the ground truth. The rewards are analyzed, and the prompt writer is continually trained based on those rewards.

This is a Google paper, so the model used was a proprietary model. They tested Pre-Write across a few datasets with quantitative outputs (exact match). They compared Pre-Write to other automated methods. The major results show Pre-Write outperforming the original prompt on AG News and Natural Questions but underperforming on the SST-2 dataset, which is basic sentiment analysis (classifying movie reviews as positive, negative, or neutral). Since it's a simple use case, there's not much room for prompt improvement. All automated methods failed to beat the original prompt on this simple dataset, highlighting that you don't need complex prompts for simple use cases.

Pre-Write outperformed other automated methods in more complex datasets. Here's a quick example of different prompts optimized for different rewards:

Initial Prompt: "Answer the question."
Length Difference: Emphasis on not exceeding 100 characters.
Exact Match: Emphasis on composing a short answer (e.g., "Who is the president of the United States?" should be "Joe Biden").

This highlights how prompts change based on different reward evaluations, something you can apply to your own prompt engineering.

Now, for a prompt quiz: look at these two prompts and guess which one performs better. If you guessed B, you were correct. The longer, more intricate prompt performed better, showing how subtle differences can lead to big outcomes.

They also broke down results by reward type. On SST-2, the original prompt performed best, while on other datasets, Pre-Write's optimizations shone. Perplexity and F1 combined had interesting differences worth exploring for your use cases.

If you have any questions or comments, let us know. We're always here to cover more topics in the future.

‍

Reinforcement learning components

The reinforcement loop consists of two major components, a policy function and rewards.

Policy function

The policy function is an algorithm that guides the Prompt Rewriter model to make decisions that will enhance the prompt, based on a certain reward. It is a probability distribution over a set of potential actions given the current state.

For example, let’s say you want to cross the street at a busy intersection. This is how your internal policy function would run:

Evaluate the current state (traffic lights, cars, distance) and potential actions (wait, start walking, speed up, slow down) to maximize safety (the reward in this case).
Through experience you’ll learn which actions lead to maximizing safety (the reward) in different traffic conditions. Continually optimizing your policy function over time.

Back to PRewrite.

The actions to consider are which tokens to add, delete, or modify, based on the current state, which is the Initial Prompt. The policy function is there to guide the prompt rewriter model in making decisions that are expected to maximize the reward, which is determined by the effectiveness of the rewritten prompt in generating accurate outputs.

‍

Rewards

Rewards are used to inform the policy function and Prompt Rewriter about the effectiveness of the newly rewritten prompt, based on the changes made. The researchers explored a few different reward functions:

Exact Match (EM): Checks if the output exactly matches the ground-truth output
F1: Combines precision (correct predictions divided by the total number of predictions made) and recall (correct predictions divided by the total number of actual positives) into one metric. For example, with 80% precision (80 correct out of 100 predictions) and approximately 78% recall (70 correct out of 90 positives), the F1 score averages these to evaluate model performance.
Perplexity: Measures the model's prediction certainty. Lower values indicating the model is less surprised by the sequence of tokens. Low perplexity: “The dog ate a bone”. High perplexity “The dog ate rigatoni ragù”. Lower perplexity is rewarded.
Perplexity + F1: Combines perplexity (the unexpectedness of the outputwith F1 (evaluating accuracy and completeness), rewarding outputs that are predictable and precise.
Length difference: Rewards based on the length difference between the output and ground-truth output.

‍

PRewrite flow

Bringing it all together, here is what the flow looks like:

Start with an initial prompt, p0
p0 is rewritten by the Prompt Rewriter, using the meta-prompt, to get a set of prompt variants
All of the variants are fed into an LLM to generate outputs
The Prompt Rewriter continually gets trained using reinforcement learning based on rewards determined by the effectiveness of the generated output against the ground-truth.

‍

Experiments set up

Datasets: Natural Questions (NQ), SST- 2, and AG’s News

Models used: PaLM 2-S for the Prompt Rewriter and the model that runs the rewritten prompt

‍

Experiment results

Let’s dive right in.

‍

Table of results from the experiments showing different methods and their accuracies on various datasets — Task accuracy percentages across the datasets

Takeaways

PRewrite outperforms the original prompt for NQ and AG, but not for SST-2. This is most likely due to the fact that the tasks in SST-2 are extremely simple and don’t have a lot of room for improvement over the initial prompt
As a point of reference, the SST-2 dataset focuses on sentiment analysis derived from movie reviews, e.g., “contains no wit, only labored gags.”
ALL of the automated methods fail to beat the original prompt on the SST-2 datasets. This goes to show that you can over-engineer prompts.
PRewrite outperforms all the other automated methods

‍

Examples

‍

A table with different prompts and reward combos and their corresponding accuracies — Rewritten prompts with various rewards for the “Natural Questions” dataset

‍

Here are a few examples of how different rewards affect the final rewritten prompt. Again, it isn’t always the longest or most detailed prompt that wins out. In this case, optimizing for Perplexity+F1 leads to the highest accuracy.

There are big changes (~10%) in accuracy depending on the reward mechanism here.

‍

Now the longest and most detailed prompt performs the best, by a few tenths of a point.

What’s most interesting is the difference between the initial prompt and the rewritten prompt with Perplexity + F1 as the reward. They're so similar, yet the performance gap is huge (10%)! Another example showing how subtle changes can make a huge impact in prompt engineering.

‍

Results broken down by Reward type

‍

Table comparing the different accuracy levels based on reward mechanism across multiple datasets — Task accuracy % on eval datasets, broken down by different rewards

This table comes directly from the paper, but I think it would be more useful to have the performance of the original prompt as well.

‍

Reward	NQ	SST-2	AG’s News
None (Original Prompt)	24.1	96.7	76.9
EM	29.3	95.5	84.5
F1	30.6	95.5	84.5
Perplexity	26.5	95.8	60.1
Perplexity + F1	32.3	96.4	84.2
Length difference	29.5	N/A	N/A

‍

On average, “Perplexity + F1” is the best performing reward for all the datasets
“Perplexity” performs significantly worse compared to “Perplexity + F1” and is even outperformed by the original prompt in the AG’s News dataset.
Rewarding for perplexity ensures predictability of responses by the model, but incorporating the F1 score guarantees accuracy too. This addresses both the quality and relevance of content effectively.

Wrapping up

There are a lot of questions around using LLMs to write prompts. Our opinion is that you get the best outputs from combining human effort with LLMs. This new paper provides a new way to think about a framework for automating some of this work, and is definitely worth exploring!

Dan Cleary

Founder

Using Reinforcement Learning and LLMs to Optimize Prompts

What is PRewrite

PRewrite components

Prompts

Reinforcement learning components

PRewrite flow

Experiments set up

Experiment results

Examples

Results broken down by Reward type

Wrapping up

Get the week's best prompt engineering and AI content

Join thousands of AI builders

More from the PromptHub Blog

How to Get Better Outputs from GPT-5

Why Long Context Windows Still Don't Work

Feature Launch: Pipelines