Historically speaking, even before AI and Large Language Models (LLMs) became cool, recommendation systems were one of the earliest use cases for AI.  

The rise of LLMs has made building a recommendation system 100 times easier. What used to take months can now be done in a few days with some prompt engineering.

What makes things even easier is the framework laid out in this paper: RecPrompt: A Prompt Tuning Framework for News Recommendation Using Large Language Models.

Even if you aren’t building a recommendation system on top of LLMs, the principles in the RecPrompt framework can be applied to many AI use-cases.

What is RecPrompt

RecPrompt is a prompt engineering framework designed to enhance news article recommendations.

RecPrompt has three main components: A Prompt Optimizer, a Recommender, and a Monitor.

The flow of information for the RecPrompt framework
RecPrompt Framework

The Recommender generates news recommendations, which, along with the initial prompt, are fed into the Prompt Optimizer.

The Optimizer refines the prompt based on the example recommendations provided, enhancing alignment with user preferences based on previous recommendations. This creates a feedback loop that leads to responses that are more in line with the user’s click history.

The Monitor measures and records the effectiveness of the newly generated prompt against specific performance metrics like Mean Reciprocal Rank (MRR) and others.

Below are some of the exact prompts used.

System Message Prompt

Initial Prompt Template

We also put together a template you can use directly in PromptHub. It is based off the template above, with some structural enhancements.

RecPrompt Template in the PromptHub web application

Hey guys, how's it going? Dan here, co-founder of PromptHub, here to talk today a little bit about recommendation systems and how you can use prompt engineering to build something really quickly. The basis for what we'll be looking at today is a framework called RecPrompt from a recent research paper. It is essentially a prompt engineering framework using multiple LLMs to create news recommendations, but the recommendations could be applied to any type of entity, whether that's restaurants, music, books, etc. Shout out to the research team that put this together; it’s an insightful piece of work, and we'll jump right in.

There are a few components to this prompt engineering framework: a prompt optimizer, a recommender, and a monitor. The optimizer optimizes the prompt using LLMs, or in this case, they also tried manually optimizing. The recommender is the component that actually makes the recommendations, and the monitor keeps track of all the different recommendations made and evaluates them against certain metrics to see how they perform quantitatively.

Here is the general flow; it might look a little complicated, but we can break it down pretty simply. We have the optimizer, the recommender, and the monitor. The recommender is the one that generates the news recommendations. That, along with an initial prompt template, goes into the prompt optimizer. You start with some initial prompt template—even on the first run, it's just something basic that says, "Based on the user's news history," and so on. That, plus some of the recommendations, goes into the prompt optimizer. The prompt optimizer has a system instruction that is static or frozen, and that, with the template plus the examples, are sent to the optimizer along with a meta prompt. This is the prompt that tells it, "Hey, based on the observations, enhance the prompt." All these different things get packaged up and sent to the prompt optimizer.

On the other side, the prompt optimizer outputs a prompt including some of these samples and any updates the LLM made. That becomes what gets sent to the monitor and the user, and everything is tracked through the monitor. The monitor measures and records the effectiveness of these newly generated recommendations across a couple of different sets of evaluation metrics, which are represented by the gray letters here. It's not super important right now; we'll go further into the evaluation set a little bit more. The refined prompt goes to the recommender, and everything is tracked through the monitor. So it's a little complicated, but also not too complicated.

Here’s a closer look at the system instruction message from the last graphic, put up a little closer here so you can get a better idea. It's pretty straightforward; nothing too crazy to write home about. Here's that initial prompt—the first initial prompt used that would be fed into the prompt optimizer. You see we're inputting a lot of variables here. They break it up using headings and markdown. They take input, there's history, and here's the candidate news (the potential news to recommend). There are slots for all these variables to get filled in. We turned this into a template that you can use directly in PromptHub, so you can grab it, and we'll link it below as well.

We looked at the optimizer before, and in that case, it was an LLM-based optimization. We’re basically grabbing all these things—the recommendations, the initial prompt template, the system instruction—and feeding it all into an LLM to get an enhanced prompt on the other side. The researchers also tested manually updating the prompts that were eventually used to make the recommendations. They created a situation where they could compare two prompts: the ones being manually created and the ones being optimized by the LLM. We've written a lot about using LLMs to optimize LLMs, and you can read about it all on our blog. Our general position is that we don't think just using an LLM to make your prompt better gets you the best results. We find that it's through iteration and some human intervention, plus using LLMs for some part of it, that you get the best results.

They ran a bunch of experiments testing these different types of methods. They got a new dataset from Microsoft and used GPT-3.5 and 4. They tested both news recommendation methods (which are just kind of classic recommendation methods) and these are random, most-pop, and topic-pop methods. Basically, random selects ones randomly, most-pop looks at the most popular based on the total number of aggregate views across the dataset, and topic-pop is more related to the specific user based on their browser history. These are not LLM-based methods but still recommendation methods. They also tested a bunch of deep neural methods. I won't go too deep into all of them; these are just well-known deep neural models for making recommendation systems.

So we have the simple recommendations, the deep neural ones, and the LLM-based ones. Within the LLM ones, we have the handcrafted prompt versus the LLM-optimized prompt, which is that bottom one. What are we looking at here? We can see topic-pop is the top performer in the first group of methods, which makes sense because topic-pop focuses on the user's browsing history more so than the global dataset. All the deep neural models outperform the top method here, and coming down to the LLM-based ones, there are interesting takeaways.

The initial prompt with GPT-4 outperforms most of the neural models. The initial prompt is just running that prompt we looked at earlier, not doing any further optimization or anything too crazy. We can see just using a really strong model like GPT-4 outperforms most of these methods but not all. There's a clear pattern that an LLM-generated prompt tends to perform better than a handcrafted prompt, which tends to perform better than just an initial static prompt. They’re close, though, and that's important to keep in mind. We'll go deeper into the percentage differences between these different LLM prompt methods. We see a clear, big distinction between GPT-4 and 3.5, which is expected. The only LLM-based recommendation method that beats all the neural models is the last row here, having the LLM generate the prompt using GPT-4.

Now, for my favorite part of this paper, digging into the different trade-offs between the different LLM processes. If we hyperfocus on just GPT-3.5 and 4 and pick one dataset, this pattern is pretty similar across all of them. If we look at the initial prompt, which is just a static prompt, nothing crazy going on, the handcrafted prompt is using the whole framework but having a human update the prompt based on performance, and then LLM-generated is that automated process we looked at earlier. We can see from the initial prompt to the handcrafted prompt there's a 1% gain here, a 3% gain there, not crazy. You could argue that the time it would take to implement the framework may not be worth it for that relatively small gain, but it depends on your use case. Then jumping from the initial prompt to automating the whole process using an LLM-generated prompt, you get a 6%, 4%, or 5% gain on average. Again, this could be significant or insignificant, and setting up the framework might be straightforward or more challenging, depending on your situation.

When reading these papers, consider your situation, how much engineering power you have, how easy it is to set up these things, and understand that the underlying base models are very strong. Having just a good initial prompt can get you far. Here's what that initial prompt looks like—pretty straightforward. You're pulling in some information and not doing anything too crazy. It all comes down to whether that percentage difference makes a big difference. If you have something at scale, it probably does; if this is just a proof of concept, the initial prompt might be good enough.

You can try this in PromptHub today. We'll have a link to it below as well. Happy prompting, and let me know what you think. If you implement it, feel free to drop us a message or comment below. See you!

Prompt Engineering Techniques used in RecPrompt

A very cool aspect of the RecPrompt paper is their testing of two methods for the prompt optimizer: manually tuning the prompt and letting an LLM update the prompt.

We’ve talked a few times about using LLMs to optimize prompts, (Using LLMs to Optimize Your Prompts, How to Optimize Long Prompts).

We believe LLMs can be great prompt optimizers, but we’ve seen the best results when combining humans and LLMs.

Manual prompt engineering method

RecPrompt starts with an initial prompt template. This template is then manually updated during the optimization process. This involves tweaking things like the instructions and descriptions to improve the LLM's understanding and recommendation performance.

LLM-based prompt engineering method

This method automates the prompt iterating process using another LLM.

The Prompt Optimizer refines the initial prompt template by integrating four components into a single input for the LLM:

  1. The system message
  2. The current candidate prompt template
  3. A set of samples from the recommender
  4. An observation instruction (meta-prompt) which guides the LLM to adjust the prompt template to better align with the desired improvements and recommendation objectives.

The output from the optimizer is the enhanced prompt template.

The refined prompt is submitted to the recommender LLM to make news recommendations.

Experiment set up

The researchers put RecPrompt to the test across various datasets to evaluate its effectiveness.

Datasets

  • Microsoft News Dataset (MIND): Collection of news articles

Evaluations

  • RecPrompt’s performance was evaluated across a few metrics:  AUC, MRR, nDCG@5, and nDCG@10

Implementation Details

  • Models Used: GPT-3.5 and GPT-4

Baselines

RecPrompt was compared against a few news recommendation methods and deep neural models.

News recommendation methods

  1. Random: Randomly recommends candidate news to the user
  2. MostPop: Recommends news based on the randomly selected views, aggregating the total number of views across the dataset
  3. TopicPop: Suggests popular news articles based on the user’s browsing history.

Deep neural models

Below are the deep neural models tailored for news recommendations. It’s not very important to understand the intricacies, but here is a little info on each.

  • LSTUR: Integrates an attention-based Convolutional Neural Network (CNN) for learning news representations with a GRU network
  • DKN: Employs a Knowledge-aware CNN for news representation and a candidate-focused attention network for recommendations.
  • NAML: Uses dual CNNs to encode news titles and bodies, learning representations from text, category, and subcategory, with an attention network for user representations.
  • NPA: Features a personalized attention-based CNN for news representation, coupled with a personalized attention network for user modeling.
  • NRMS: Applies multi-head self-attention mechanisms for news representation and user modeling.

Experiment Results

Alright, let’s take a look at the results.

Table of results from the experiments

Takeaways

  • TopicPop is the top performer in the first group of methods, probably due to the fact leverages users's browsing history to provide more tailored recommendations
  • All the deep neural model methods outperform TopicPop
  • AML, on average, outperforms all other methods in its group
  • Initial Prompt with GPT-3.5 performs worse than TopicPop and all the deep neural models
  • Initial Prompt with GPT-4 outperforms most deep neural models
  • There is a clear pattern when it comes to effectiveness: LLM-Generated Prompt > Hand-Crafted Prompt > Initial Prompt.
  • GPT-4 is a better prompt optimizer compared to GPT-3.5
  • The only LLM based recommendation method that beats all of the deep neural models is LLM-Generated Prompt using GPT-4

Here's my favorite part

Let's dive deeper on the incremental improvements of the three prompt based methods. We’ll focus on just one of the evaluation sets (MRR, because I’m a SaaS founder), although the trends are similar across all the sets.

GPT-3.5 performance on MRR

Initial Prompt: 36.04

Hand-Crafted Prompt: 36.39 .97% increase compared to Initial Prompt

LLM-Generated Prompt: 38.27 6.19% increase compared to Initial Prompt

GPT-4 performance on MRR

Initial Prompt: 45.31

Hand-Crafted Prompt: 46.65 2.96% increase compared to Initial Prompt

LLM-Generated Prompt: 47.24 4.26% increase compared to Initial Prompt

If you need to recall what the Initial Prompt is, I’ve copied it below for reference:

It’s just a single prompt. Compared to the other methods it is easier to implement. Recreating the larger prompt engineering framework in full is neither extremely challenging nor trivial.

The percentage differences in performances can be seen as large or small in relation to the work needed to implement the framework. It really depends on your use case. If Google can make Google Search .00001% faster, that’s a big deal. But if your use case isn't as sensitive to smaller percentage gains, or if you just need to get something out quickly, I would advise to use GPT-4 and leverage the basic prompt template above.

Wrapping up

RecPrompt can teach us a few things about prompt engineering. The automated framework is a good use case of using LLMs to optimize prompts, in a way that is structured and has a recurring feedback loop.

Sometimes the most interesting takeaways from studies like these are pretty simple. The absolute easiest way to get better results is to use GPT-4. Depending on your use case, implementing a framework like RecPrompt and be the boost that makes your product more compelling than your competitors, or it may not be neccesary as a starting point. Either way, now you have the tools and knowledge to decide!

Dan Cleary
Founder