Now that prompt engineering has had time to develop, we've started to learn what works and what doesn’t. Some prompting best practices have emerged, like chain-of-thought reasoning and few-shot learning.
As always, we are here to help you achieve better outputs from LLMs, so let's dive into a recent paper that has gained some popularity in the mainstream: Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4.
Let’s go a bit deeper than just looking at the list that the researchers put together.
Word choice matters
It's worth repeating because of how true it is: specific word choice plays a huge role in prompt engineering. Adding emotional stimuli like "This is very important to my career," or reasoning language like "take a deep breath and work on this problem step-by-step" has been shown to increase accuracy by 20% in some cases.
Let’s jump in and take a look at the 26 design principles that the researchers tested.
Prompt design principles
The principles are broken down into 5 categories: Prompt Structure and Clarity, Specificity and Information, User Interaction and Engagement, Content and Language Style, Complex Tasks and Coding Prompts.
In general, these principles are designed to be:
- Concise and clear
- Contextually relevant
- Aligned with the task
- Accompanied by example demonstrations
- Free from bias
We combined these principles along with their performance improvements results into a single table.
The table below is for GPT-4 specifically. If you want to see the performance metrics for GPT-3.5 and access the Google Sheet, join our newsletter and you'll get it in your inbox.
We'll dive deeper into how evaluations were performed further down, but for now:
Improvement %: By how much the output improved, compared to the baseline, based on human ratings
Correctness %: How much more often the outputs were deemed accurate, relevant, and free of errors
GPT-4 performance improvements by principle
Want to see the performance metrics for GPT-3.5 or get direct access to the data via a Google Sheet? Join our email newsletter and you'll get it in your inbox right away.
Our top 4 principles
We looked at all the principles and their data, here are four of our favorites.
Best practices are best practices for a reason. Chain-of-thought reasoning helps models produce better outputs.
Helping the model help you is a great way to accomplish a task. This approach is heavily backed by research (Eliciting Human Preferences with Language Models), and it is the method behind one of the more popular CustomGPTs, Professor Synape.
The best advice often needs to be repeated.
In our first blog post, 10 Best Practices for Prompt Engineering with Any Model we mentioned that using delimiters, like triple quotes ("""), can help the model better understand the distinct parts of your prompt.
For some concrete examples, you can see how delimiters are used in prompts by top AI companies like OpenAI, TLDraw, and Vercel here: What We Can Learn from OpenAI, Perplexity, TLDraw, and Vercel's System Prompts
The researchers tested the 26 principles on the ATLAS dataset, which contains 20 human-selected questions for each principle. The benchmark was a manually written prompt.
Models and Metrics
- Instruction fine-tuned LLaMA-1-7B and LLaMA-1-13B
- LLaMA-2-7B and LLaMA-2-13B
- Off-the-shelf LLaMA-2-70B-chat
The models were grouped based on size:
- Small-scale: 7B models
- Medium-scale: 13B models
- Large-scale: 70B (Example: GPT 3.5/4)
The principles were evaluated on two metrics, “boosting” and “correctness”.
Boosting: Humans assessed the quality of the response before and after applying the principle.
Correctness: Humans determined if the outputs are accurate, relevant, and free of errors.
Before we look at results, here are a few examples. While this paper provides good insights, I believe some of the results are inflated due to a poor initial prompt. It’s not egregious, but it is worth noting.
Before we look at some graphs, here are some high level metrics:
Boosting: There was a consistent 50% improvement in responses across all LLMs tested.
Correctness: There was an average 20% increase in accuracy across all small-scale models, and a 50% increase for larger models.
- As a quick example to better understand the graph, a 100% improvement (principle 14), means responses were twice as good when the principle was used.
- On average, larger models tend to show greater improvements in response quality.
- A quick example to better understand the graph: A 65% improvement (principle 3), means responses were 65% more accurate compared to the prompt without the principle applied.
- We see larger models reaping more of the rewards here. Chalk that up to larger models having way more parameters in their data that makes contextual understand and comprehension much easier.
- There is significant variability in improvement percentages across all models
- The median improvement scores (represented by the black line in the colored boxes) is relatively consistent across models
- There's a notable consistency in the interquartile range across models, which implies that the overall impact of optimizations has a somewhat predictable range of effect across different model sizes.
- Median correctness scores increase with the model size
- GPT-4 outperformed smaller models by a wide margin
- Principles 14, 24, and 26 are particularly effective across most models
- On average, GPT-3.5 and GPT-4 show the greatest improvement
- GPT-4 shows the greatest gains in performance
- Principles 12, 18, and 24 seem to be effective across all models
While some of these principles may not apply to your use case, they are valuable in that they give you a clear set of techniques to try out. I would suggest starting by understanding where you prompt(s) are currently struggling and identify the related category. From there, check out the performance metrics (access the metrics in full via our newsletter above), and start off with the highest leverage principle.
Hopefully this helps you get better outputs!