Compressing Prompts with LLMLingua: Reduce Costs, Retain Performance

Prompt Engineering has evolved significantly in the past year. Slowly, best practices are being established. In-context learning (ICL) via few shot prompting and Chain of Thought prompting are two of the most prominent. While these patterns are effective, they result in longer and longer prompts, which means higher costs and latencies.

In a recent article, we discussed a method to optimize long prompts, but that focused on output quality, rather than controlling length and latency. This is why a recent (December 2023) research paper from Microsoft caught our eye. Titled LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models, the researchers looked at how we can compress long prompts without losing out on performance.

In some cases, the researchers were able to compress a prompt by a factor of 20 while experiencing little to no degradation in performance! It’s almost unbelievable, and there are some nuances to account for, so let's dive in.

What is LLMLingua?

LLMLingua is a prompt compression framework/method. It takes a prompt as an input, goes through some steps, and outputs a compressed version of that prompt.

It consists of three major components:

The budget controller: Controls how much each part of the prompt gets compressed. For example, the few-shot examples should be compressed more than the instructions. It operates at the sentence or demonstration level.
Token-level prompt compression algorithm: Divides the prompt into segments and compresses it iteratively at the token level, until reaching some threshold.
Instruction tuning method to align the LLMs used in the process: LLMLingua uses a small model for various processes. Instruction tuning better aligns this model with the black-box LLM that will be used for the final generation (OpenAI or Anthropic’s models). This improves the accuracy and efficiency of the compressed prompts.

‍

How LLMLingua works

Let’s dive into each of the three major components of LLMLingua.

‍

Hey guys, how's it going? Dan here, co-founder of PromptHub. We're here to talk today about LLM Lingua, a prompt compressor that can save you up to 20x on your prompt tokens while also reducing latency and even increasing performance.

So, LLM Lingua came from the research team at Microsoft. There's a full paper and accompanying GitHub repository, which we'll link to below. Basically, it's solving the problem of prompts getting really long. We know that in-context learning works, we know that Chain of Thought and RAG all help get better outputs, but it's making the prompts that we're sending to LLMs much longer. This means it's more expensive, and we can get lost in the middle with these really long prompts. There are a lot of other performance issues that come along with long prompts, even though they get better performance than shorter prompts. So, LLM Lingua looks to address that through prompt compression.

LLM Lingua has three major components. In essence, it takes a prompt, does a bunch of compressing via a small model, and then outputs a compressed prompt. It first breaks up the initial prompt into the demonstrations (any in-context learning examples) and then the instructions and the questions (what you actually want the model to do).

The first step in this process is taking that original prompt and running it through the budget controller. The budget controller starts by looking at those in-context learning examples and compresses those first. The thinking with compressing those first is that they are probably the most redundant part of a prompt and are not as important as the instructions, which have much more relevant information for what the model has to do.

So, a small LLM, GPT-2 in this case, reads through the demonstrations. There's a compression rate that is configurable by the user. Let's say you want to compress your prompt by 10x. It uses that small model to look at the different sentences in the examples and calculates the perplexity of each one. A low-perplexity sentence is something that the model is not surprised by, something it could predict with high accuracy based on its training. High perplexity means the model is surprised by it based on its training and understanding of tokens and probabilities. It keeps those low-perplexity ones, gets rid of the high ones, compresses the low ones, and does that until it goes through all the sentences in the demonstrations. The remaining compression budget is allocated to the questions and instructions part of the prompt, which goes to the second step: moving from a sentence-level compression to a token-level compression.

It starts here with an empty set that we'll be adding stuff back into. It takes the compressed demonstrations from step one, the original instructions, and the original questions, chops that up into different segments, and then looks at each segment to calculate the conditional probabilities for the tokens inside that segment. Similar to perplexity, it looks at the conditional probabilities, which say how likely the LLM was to guess the next token. If the probability exceeds a certain threshold, it gets compressed; if not, it gets left behind. This process removes tokens that contribute less to the overall context of the prompt by removing the ones the model wasn't expecting. Everything that's kept gets added back to this empty set, and that is concatenated into the compressed prompt.

Now we have our compressed prompt. The last core component, although it technically happens before any of this, is that we need to align the small model with the larger model that will be running the final prompt (whether that's GPT-4 or whatever it might be). It's important to have this small model doing all the compressions in line with the larger model for several reasons. We need them to think the same so that the model doing all the compression aligns with what the actual model running it thinks.

The researchers tested LLM Lingua across different datasets: math and reasoning, writing, summarization. They used GPT-3.5 and Claude 1.3 and compared it to other compression methods like GPT-4 generation (telling GPT-4 to compress a prompt), random selection (randomly selecting sentences to compress), and selective context (which we'll link to for deeper exploration).

Looking at the first two datasets, compared to selective context and random sentence selection, LLM Lingua doesn't outperform by much. In two cases, random sentence selection at the 2x constraint performs better. They're all within one or two points of each other. This trend continues for another dataset. They're within about 4%. It certainly doesn't blow them out of the water. The distance increases at higher compression (3x), so it seems LLM Lingua is better at higher compression rates, but it doesn't significantly outperform the baselines.

The next set of results is much more interesting. LLM Lingua really distances itself on math and logical reasoning sets, especially at higher compression ratios (20x), showing a much bigger difference (up to 20 points) compared to other baselines. Even at 20x compression, LLM Lingua's compressed prompt outperforms the simple prompt and is basically in line with the full-shot prompt (2400 tokens reduced to 115). That's a huge reduction, a 20x reduction in cost while maintaining similar performance.

Interestingly, they snuck in a small part for Claude, not included in the main table (which might be expected in a Microsoft paper). The simple prompt for Claude has a score of 81.8 versus 75 in the other sets. Claude at baseline is higher, and the highest score comes from Claude even at high compression ratios of 14x. You could argue that using LLM Lingua with Claude as the model running the final prompt would be beneficial.

The researchers ran variations with different parts of LLM Lingua taken out, finding that most parts were important. Distribution alignment reduced the score by only 0.5 when not done, so maybe it's not as crucial. Other parts showed about a 10% decrease if removed, indicating their importance.

Generation lengths tend to get shorter as the prompt gets compressed, which can be good. Generation tokens are the chief contributors to latency, not so much the prompt tokens. Shorter generation lengths mean more concise outputs, taking less time to generate, but you might still want those additional 100 tokens. LLM Lingua's performance reaches a plateau and drops quickly, so it's not advisable to go above 20x compression. There's an argument for keeping compressions low enough to remain semantically understandable to humans, even if LLMs can understand it.

We're building this compression feature into PromptHub. You'll be able to compress your prompts with a click of a button. We're ironing out nuances we found from studying the paper and implementing it ourselves. We'll have this live in our product soon. If you have any questions or want early access, let us know. If there are any other papers you want us to dive into or if you have any questions or missed points about LLM Lingua, drop them in the comments. Have a good one. Thanks, guys.

‍

1. Budget Controller

Before any compression, the prompt is separated into different components: The instructions, demonstrations, and the question. The budget controller dynamically allocates different compression ratios to each component.

Generally, the instruction and question in a prompt are more critical and influential. On the other hand, there is often some level of redundancy in the demonstrations.. Thus, more budget (less compression) is applied to the instructions and question, and more compression is applied to the demonstrations. This prioritizes compressing the demonstrations first, allocating any remaining budget to the other components of the prompt.

‍

Let’s see how this actually works in practice:

‍

Image laying out the many steps in the algorithm that the budget controller runs — Budget controller algorithm

‍

Okay lets run through this:

Start with a set of demonstrations from the original prompt
The compression rate is configurable by the user
Use a small LLM to compute the perplexity (more on this below) of each demonstration
Keep the lower perplexity demos (up to a threshold) and compresses all of them
Once the loop completes, the remaining compression budget is allocated to the instruction and question components of the prompt

‍

Perplexity quantifies how well a model will predict a sample. How likely a model is to generate that sequence based on its learned probabilities.

Low perplexity: The model predicts the sequence with high accuracy, suggesting the sequence is consistent with what the model 'expects', based on its training.
High perplexity: The sequence is less predictable. The model is more 'surprised' by it.

‍

2. Token-level prompt compression algorithm

At this point, we got rid of the high perplexity demos and compressed the others.

Now we move from sentence-level compression to token-level compression.

‍

Image laying out the many steps in the algorithm that the token-level prompt compressor runs

‍

Here are the steps:

Start with an empty set T (which will later be filled with tokens)
Take the prompt, now comprising of the original instructions and questions along with the compressed demos, and divide it into segments
Iterate over the segments and calculate the conditional probabilities for the tokens within
If the conditional probability of a token exceeds a certain threshold, it is compressed. The compressed segment is then added to set T. This process removes tokens that contribute less to the overall context of the prompt.
This process continues until all segments have been evaluated
The final prompt is assembled by combining all the compressed segments

‍

3. Instruction tuning for model alignment

The final step ensures that the smaller model used in the budget controller and token-level compressor aligns with the larger model that will process the final, compressed prompt.

This step essentially involves transferring knowledge from a larger, more capable model to a smaller, more efficient one, ensuring the smaller model performs tasks similarly to the larger one.

Benefits of better alignment:

Consistency in Probability Estimation: The small model estimates the conditional probabilities of token segments during compression. Aligning it with the behavior of the larger model ensures more accurate probability estimates.
Improved Compression Decisions: The compression algorithm uses probability estimates to decide which text parts to compress. An aligned small model will make more effective decisions, preserving the semantic integrity of the compressed text.
Transfer of Knowledge: Large models have a large knowledge base. Aligning the small model to the large model enables the transfer of this understanding.

‍

Experiment Setup

LLMLingua was put to the test across various datasets.

‍

Datasets

GSM8K: A dataset focused on mathematical reasoning, testing LLMLingua's ability to compress prompts in domains requiring logical and numerical understanding.
BBH (Big-Bench Hard): A dataset that includes tasks that require complex reasoning, testing LLMLingua in contexts that demand high cognitive capabilities.
ShareGPT: This dataset is centered around conversational tasks, evaluating LLMLingua's performance in compressing prompts for dialogue-based scenarios.
Arxiv-March 2023: A summarization dataset derived from scientific articles, this set tests LLMLingua's ability to effectively condense and convey information.

‍

Evaluations

For GSM8K and BBH: Exact match was used as the primary evaluation metric. This measures whether the generated response exactly matches the expected answer

For ShareGPT and Arxiv-March23: A combination of BLEU, ROUGE, and BERTScore metrics were used to assess the quality of outputs in comparison to human-generated texts.

‍

Implementation details

Models used: GPT-3.5-Turbo-0301 and Claude-v1.3
Small model used for compression: Alpaca-7B or GPT2-Alpaca

‍

Baselines

The team compared LLMLingua against several benchmark methods.

GPT4-Generation: Directly instructing GPT-4 to compress the original prompt
Random Selection: Randomly selects which elements of the original prompt to compress
Selective-Context: Utilizing phrase-level self-information from a small language model, this method filters out less informative content from the prompt. It aims to retain the most informative or critical parts while removing less significant content (more info here).

Experiment results

Let’s start with the ShareGPT and Arxiv-March23 datasets

‍

table of results from the ShareGPT and Arxiv-March23 dataset for LLMLingua

‍

Sentence Selection = Random Selection baseline

“2x constraint” = Compress the original prompt to half the size

“3x constraint” = Compress the original prompt to a third of the size

‍

Now for some takeaways:

LLMLingua achieved acceleration ratios of 9x and 3.3x (process was 9 and 3 times faster)
High BS F1 scores indicate successful retention of semantic info from the original prompts
Random sentence selection ("Sentence Selection" in the table) outperformed LLMLingua twice and is relatively close in performance many other times
Under the 2x constraint, all three baselines perform similarly, with an average difference of about 4%. This suggests that at lower compression levels, for use cases related to comparing or summarizing texts, any of these methods could work.
LLMLingua is less sensitive to higher compressions. When moving from 2x to 4x compression LLMLingua's performance decreases the least.

‍

Next up GSM8K and BBH, the reasoning and in-context learning-related benchmarks

‍

Table of results from the GSM8K and BBH datasets

1-shot constraint = The model was given 1 example in the prompt

1/t = compression ratio

‍

Some learnings:

With a 1-shot constraint, the LLMLingua compressed prompt achieved slightly higher results than the full-shot prompt at compression ratios of 5x and 3x.
As compression ratios increased under half-shot and quarter-shot constraints, there was a slight decline in performance. On the GSM8K dataset, the Exact Match (EM) scores decreased by 1.44 and 1.52, respectively, at 14x and 20x compression ratios. Seems like a small degradation given the level of compression.
Contrary to the first set of results, LLMLingua easily beats the other compression baselines
Even at 20x compression, GSM8K EM scores remain high dropping by less than 2 points
These points suggest that LLMLingua’s effectiveness varies based on the task. It appears to be very effective on reasoning tasks (GSM8K and BBH), while only being moderately better on conversational and summarization tasks (ShareGPT and Arxiv-March2023).

‍

Don’t forget about Claude!

For “cost reasons” the researchers only tested Claude-v1.3 on the GSM8K dataset. They also buried it deep in the paper and left it off the main table of results.

‍

Small table showing the Claude results on the GSM8K dataset

LLMLingua showed improvements over the simple prompt by 0.8 and 1.7 EM points at compression ratios of 5x and 14x, respectively.
It's worth noting that the Simple Prompt score here is higher than the Simple Prompt score in the table above this one (74.9). Showing that with just a simple prompt, Claude beats out GPT-3.5-turbo in this case.
Maybe we shouldn’t be surprised that Microsoft researchers buried this, but it looks like LLMLingua was most effective when using Claude 1.3 compared to GPT-3.5-Turbo

‍

Ablation study

Now for my favorite part. The researchers tested five variants of LLMLingua to see which components contributed to the overall performance.

‍

Table showing variants of LLMLinua and the performance

‍

LLMLingua w/o Iterative Token-level Compression: This variant performs token-level compression in a single step rather than iteratively. The EM score decreased significantly from 79.08 to 72.93, indicating that iterative compression is important.
LLMLingua w/o Budget Controller: This variant applies the same compression ratio across all prompt components. The EM score dropped to 73.62, showing that dynamically allocating compression ratios to different parts of the prompt is worthwhile.
LLMLingua w/o Dynamic Compression Ratio: This variant uses a static compression ratio for all components, resulting in a lower EM score of 77.26. Not a huge drop.
LLMLingua w/ Random Selection in Budget Controller: Instead of selecting sentences based on perplexities, and conditional probabilities, this variant randomly selects them. The EM score took a big drop to 72.78.
LLMLingua w/o Distribution Alignment: By removing the distribution alignment component, the model directly uses the pre-trained LLaMA-7B small language model. The slight decrease in the EM score to 78.62 indicates that the alignment process may not be critical.
LLMLingua w/ Remove Stop Words: Removes stop words from the original prompts.

‍

Other findings and limitations

‍

Graph comparing compression ratios and generation token length

‍

As compression ratios increases the length of the output decreases, with variance
This could be a good: Reduces resources on the generation stage which is the chief contributor to latency (see here)
This could be bad: You may lose out on some of the good stuff!

‍

LLMLingua has its limitations and reaches a compression plateau.

‍

Chart comparing compression ratio and exact match

‍

Big performance drop when reaching really high compression ratios
LLMLingua’s ("Ours") drop occurs at higher compression ratios comparatively

‍

Wrapping up

Let’s finish with an example.

Say you have a prompt that is 2,000 tokens long, you're using GPT-4 which currently costs $0.03/1,000 prompt tokens, and you have 2,000 requests/month. Using LLMLingua, let’s compress it by 10x.

Initial prompt

Length: 2000 tokens in length
Cost: 2,000 tokens * $0.03 per 1,000 tokens * 2,000 requests/month = $120.00/month

Compressed prompt - 10x compression

Length: 200 tokens
Cost: 200 tokens * $0.03 per 1,000tokens * 2,000 requests/month = $12.00/month

That’s a 10x reduction in cost! Of course, you’d need to ensure performance is stable. You may need to reduce the compression rate, or maybe you can go even higher.

LLMLingua can have a huge impact for anyone building prompts into production, but there are several nuances that come along with it. We’ve taken the time to iron out these nuances (what if your prompt doesn’t have clear distinctions between instructions and questions?) and are launching these compression capabilities directly into PromptHub. Right now it is early access only, so reach out if you're interested!

Dan Cleary

Founder