Now that prompt engineering has had time to develop, we've started to learn what works and what doesn’t. Some prompting best practices have emerged, like chain-of-thought reasoning and few-shot prompting.
On a more granular level, specific instruction methods and phrases have developed as well, like EmotionPrompt, "According to" prompting, and reasoning phrases like "Take a deep breath.”
Let’s go a bit deeper than just looking at the list that the researchers put together.
Word choice matters
It's worth repeating because of how true it is: specific word choice plays a huge role in prompt engineering. Adding emotional stimuli like "This is very important to my career," or reasoning language like "take a deep breath and work on this problem step-by-step" has been shown to increase accuracy by 20% in some cases.
Let’s jump in and take a look at the 26 design principles that the researchers tested.
Prompt design principles
The principles are broken down into 5 categories: Prompt Structure and Clarity, Specificity and Information, User Interaction and Engagement, Content and Language Style, Complex Tasks and Coding Prompts.
In general, these principles are designed to be:
Concise and clear
Contextually relevant
Aligned with the task
Accompanied by example demonstrations
Free from bias
We combined these principles along with their performance improvements results into a single table.
The table below is for GPT-4 specifically. If you want to see the performance metrics for GPT-3.5 and access the Google Sheet, join our newsletter and you'll get it in your inbox.
We'll dive deeper into how evaluations were performed further down, but for now:
Improvement %: By how much the output improved, compared to the baseline, based on human ratings
Correctness %: How much more often the outputs were deemed accurate, relevant, and free of errors
GPT-4 performance improvements by principle
Principle Category
Principle Number
Principle
Improvement %
Correctness %
Content and Language Style
1
No need to be polite with LLM so there is no need to add phrases like “please”, “if you don’t mind”, “thank you”, “I would like to”, etc., and get straight to the point.
5.0
66.7
Prompt Structure and Clarity
2
Integrate the intended audience in the prompt.
100.0
86.7
Complex Tasks and Coding Prompts
3
Break down complex tasks into a sequence of simpler prompts in an interactive conversation.
55.0
86.7
Prompt Structure and Clarity
4
Employ affirmative directives such as ‘do’ while steering clear of negative language like ‘don’t’.
55.0
66.7
Specificity and Information
5
When you need clarity or a deeper understanding of a topic, idea, or any piece of information, utilize the following prompts:
-Explain [insert specific topic] in simple terms.
-Explain to me like I’m 11 years old
-Explain to me as if I’m a beginner in [field]
85.0
73.3
Content and Language Style
6
Add "I'm going to tip $xxx for a better solution!"
When formatting your prompt, start with ‘###Instruction###’, followed by either ‘###Example###’ or ‘###Question###’. Use one or more line breaks to separate instructions, examples, questions, context, and input data.
30.0
86.7
Content and Language Style
9
Incorporate the following phrases: “Your task is” and “You MUST.”
75.0
80.0
Content and Language Style
10
Incorporate the following phrases: “You will be penalized.”
45.0
86.7
Content and Language Style
11
Use the phrase “Answer a question given in natural language form”
40.0
80.0
Prompt Structure and Clarity
12
Use leading words like writing “think step by step”.
50.0
86.7
Specificity and Information
13
Add to your prompt the following phrase “Ensure that your answer is unbiased and doesn't rely on stereotypes.”
40.0
66.7
User Interaction and Engagement
14
Allow the model to elicit precise details and requirements from you by asking you questions until it has enough information to provide the needed output
-“From now on, I would like you to ask me questions to...”.
100.0
nan
Specificity and Information
15
To inquire about a specific topic or idea and test your understanding, you can use the following phrase:
-“Teach me the [theorem/topic/rule name] and include a test at the end, but don’t give me the answers and then tell me if I got the answer right when I respond”
80.0
nan
Content and Language Style
16
Assign a role to the language model.
60.0
86.7
Prompt Structure and Clarity
17
Use Delimiters.
35.0
93.3
Content and Language Style
18
Repeat a specific word or phrase multiple times within a prompt.
40.0
80.0
Complex Tasks and Coding Prompts
19
Combine Chain-of-thought (Cot) with few-shot prompts.
15.0
73.3
Prompt Structure and Clarity
20
Use output primers, which involve concluding your prompt with the beginning of the desired output
75.0
80.0
User Interaction and Engagement
21
To write an essay /text /paragraph /article or any type of text that should be detailed: “Write a detailed [essay/text/- paragraph] for me on [topic] in detail by adding all the information necessary”.
60.0
nan
Content and Language Style
22
To correct/change specific text without changing its style: “Try to revise every paragraph sent by users. You
should only improve the user’s grammar and vocabulary and make sure it sounds natural. You should not change the writing style, such as making a formal paragraph casual.”
25.0
nan
Complex Tasks and Coding Prompts
23
When you have a complex coding prompt that may be in different files :
-“From now and on whenever you generate code that spans more than one file, generate a [programming language] script that can be run to automatically create the specified files or make changes to existing files to insert the generated code. [your question].”
55.0
Specificity and Information
24
When you want to initiate or continue a text using specific words, phrases, or sentences, utilize the provided
prompt structure:
- I’m providing you with the beginning [song lyrics/story/paragraph/essay...]: [Insert lyrics/words/sentence]. Finish it based on the words provided. Keep the flow consistent.
85.0
73.3
Specificity and Information
25
Clearly state the model’s requirements that the model must follow in order to produce content, in form of the keywords, regulations, hint, or instructions.
85.0
80.0
Specificity and Information
26
To write any text intended to be similar to a provided sample, include specific instructions:
-“Please use the same language based on the provided paragraph.[/title/text /essay/answer]”
100.0
73.3
Want to see the performance metrics for GPT-3.5 or get direct access to the data via a Google Sheet? Join our email newsletter and you'll get it in your inbox right away.
Our top 4 principles
We looked at all the principles and their data, here are four of our favorites.
Hey guys, how's it going? Dan here from PromptHub. We have some super actionable tips for you today, focusing on principles that you can take into 2024 to add to your prompt testing toolkit. Looking back briefly, you know 2023 was obviously a huge year for AI in general with the launch of ChatGPT. It was actually at the end of 2022, but it was really in 2023 where things started to mature. We're seeing a lot of best practices develop in prompt engineering from few-shot learning to Chain of Thought reasoning. We're learning a lot more about how to get better responses from LLMs using specific instruction methods like emotion prompts, "take a deep breath" prompts, and Chain of Thought reasoning. There's a lot that has emerged over the last year.
Just at the end of last year, there was a very interesting paper titled "Principled Instructions Are All You Need," which basically provided a mega list of 26 prompting principles that these researchers decided to test to see how they impacted output quality and correctness. We dug deep into it to find which principles were best, analyzing how the experiments were run, the datasets used, and everything else. We're here to bring that to you today.
In general, they broke down the principles into five different categories. I find these categories helpful as a starting place to orient yourself based on your use case. For example, if you're doing content creation, the specificity and information category might be the one to look at, and so on. The list in the research paper contains 26 principles, and they later show how much each principle affected the baseline prompt. We combined those into a single table with the category, principle, and improvement and correctness percentage increases. That's available via our Substack. If you drop your email in, you'll get a link to a Google Sheet that has all of that for GPT-3.5 and 4, broken down by model to account for the different percentage increases they found through the experiments. That's free to access through our Substack.
We're going to look at a few examples. We're not going to run through all 26, but we'll check out a few that stood out to us. The first is principle 4: "Tell the model what to do, not what not to do." This was something OpenAI acknowledged in their first round of prompt engineering principles, but it wasn't listed in their most recent update of best practices. Based on our experience working with teams at PromptHub, the more you can be specific about what you actually want it to do, rather than cramming in "don't do this" and "don't do that," we've seen better performance that way. However, using negative language can be helpful in avoiding certain things, so it's kind of two edges of the same sword.
Principle 12: Chain of Thought reasoning has been around for a while, and if you're doing any sort of logical reasoning or complex task, instructing the model to think step by step, take a deep breath, or print out its thoughts in thought tags (which you can later strip out from the response) usually helps it get to a better end result.
Another interesting principle is allowing the model to help you by asking questions to get precise details and requirements. This is tailored more towards chat or conversational experiences but is really about building up proper context for the model. This can lead to better results as the model can keep asking questions until it has all it needs to complete the task. This is especially helpful if you're building any sort of chat experience.
The last principle we'll look at is using delimiters and breaking up your prompt to be more structured. This is not only a great way to get better outputs but also makes it easier for your team and whoever you're working with to understand what the prompt is doing at a glance. Breaking it up by instructions, demonstrations, and specific questions is good prompt hygiene and leads to better outputs. We have a recent blog post with examples of this from AI companies like OpenAI and Perplexity, which I'll link below.
The study took a baseline prompt from the dataset, ran it, then added the principle and saw how it performed. The baseline prompts in some cases are quite thin, so adding principles on top of them is almost a no-brainer that it will do better. Here's another example using few-shot learning for counting words. These models have not been great in math, so in cases where you're doing math, few-shot and Chain of Thought reasoning go a long way.
The experiments were set up to judge two metrics: boosting (the quality of the response) and correctness (accuracy, relevance, and error-free). They judged before and after responses on these metrics. Overall, there was a 50% improvement in boosting and 20-50% in correctness for larger models. They tested a wide range of models, and we're seeing bigger improvements for larger models like GPT-4 and 3.5.
Here are the models they looked at, and the larger the model, the bigger the improvement the principles have on average. The heat map shows principal 14 performing well across the board. GPT-4 makes the most of the principles, improving significantly with just a bit of prompt engineering.
Now that you're a bit more oriented, you can see the principles, improvement percentages, correctness percentages, and categories for both 3.5 and 4. Access this from our Substack by dropping your email in, and you'll get a link to the Google Sheet. Happy prompting, and let me know if you have any questions.
Performance
The researchers tested the 26 principles on the ATLAS dataset, which contains 20 human-selected questions for each principle. The benchmark was a manually written prompt.
Models and Metrics
Instruction fine-tuned LLaMA-1-7B and LLaMA-1-13B
LLaMA-2-7B and LLaMA-2-13B
Off-the-shelf LLaMA-2-70B-chat
GPT-3.5
GPT-4
The models were grouped based on size:
Small-scale: 7B models
Medium-scale: 13B models
Large-scale: 70B (Example: GPT 3.5/4)
Evaluations
The principles were evaluated on two metrics, “boosting” and “correctness”.
Boosting: Humans assessed the quality of the response before and after applying the principle.
Correctness: Humans determined if the outputs are accurate, relevant, and free of errors.
Before we look at results, here are a few examples. While this paper provides good insights, I believe some of the results are inflated due to a poor initial prompt. It’s not egregious, but it is worth noting.
Experiment results
Before we look at some graphs, here are some high level metrics:
Boosting: There was a consistent 50% improvement in responses across all LLMs tested.
Correctness: There was an average 20% increase in accuracy across all small-scale models, and a 50% increase for larger models.
As a quick example to better understand the graph, a 100% improvement (principle 14), means responses were twice as good when the principle was used.
On average, larger models tend to show greater improvements in response quality.
A quick example to better understand the graph: A 65% improvement (principle 3), means responses were 65% more accurate compared to the prompt without the principle applied.
We see larger models reaping more of the rewards here. Chalk that up to larger models having way more parameters in their data that makes contextual understand and comprehension much easier.
LLM Breakdown
There is significant variability in improvement percentages across all models
The median improvement scores (represented by the black line in the colored boxes) is relatively consistent across models
There's a notable consistency in the interquartile range across models, which implies that the overall impact of optimizations has a somewhat predictable range of effect across different model sizes.
Median correctness scores increase with the model size
GPT-4 outperformed smaller models by a wide margin
Principles 14, 24, and 26 are particularly effective across most models
On average, GPT-3.5 and GPT-4 show the greatest improvement
GPT-4 shows the greatest gains in performance
Principles 12, 18, and 24 seem to be effective across all models
Wrapping up
While some of these principles may not apply to your use case, they are valuable in that they give you a clear set of techniques to try out. I would suggest starting by understanding where you prompt(s) are currently struggling and identify the related category. From there, check out the performance metrics (access the metrics in full via our newsletter above), and start off with the highest leverage principle.
Hopefully this helps you get better outputs!
Dan Cleary
Founder
Better LLM outputs are a click away
PromptHub is better way to to test, manage, and deploy prompts for your AI products