Now that prompt engineering has had time to develop, we've started to learn what works and what doesn’t. Some prompting best practices have emerged, like chain-of-thought reasoning and few-shot learning.

On a more granular level, specific instruction methods and phrases have developed as well, like EmotionPrompt, "According to" prompting, and reasoning phrases like "Take a deep breath.”

As always, we are here to help you achieve better outputs from LLMs, so let's dive into a recent paper that has gained some popularity in the mainstream: Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4.

Let’s go a bit deeper than just looking at the list that the researchers put together.

Word choice matters

It's worth repeating because of how true it is: specific word choice plays a huge role in prompt engineering. Adding emotional stimuli like "This is very important to my career," or reasoning language like "take a deep breath and work on this problem step-by-step" has been shown to increase accuracy by 20% in some cases.

Let’s jump in and take a look at the 26 design principles that the researchers tested.

Prompt design principles

The principles are broken down into 5 categories:  Prompt Structure and Clarity, Specificity and Information, User Interaction and Engagement, Content and Language Style, Complex Tasks and Coding Prompts.

In general, these principles are designed to be:

  • Concise and clear
  • Contextually relevant
  • Aligned with the task
  • Accompanied by example demonstrations
  • Free from bias

We combined these principles along with their performance improvements results into a single table.

The table below is for GPT-4 specifically. If you want to see the performance metrics for GPT-3.5 and access the Google Sheet, join our newsletter and you'll get it in your inbox.

We'll dive deeper into how evaluations were performed further down, but for now:

Improvement %: By how much the output improved, compared to the baseline, based on human ratings

Correctness %: How much more often the outputs were deemed accurate, relevant, and free of errors

GPT-4 performance improvements by principle

Principle Category Principle Number Principle Improvement % Correctness %
Content and Language Style 1 No need to be polite with LLM so there is no need to add phrases like “please”, “if you don’t mind”, “thank you”, “I would like to”, etc., and get straight to the point. 5.0 66.7
Prompt Structure and Clarity 2 Integrate the intended audience in the prompt. 100.0 86.7
Complex Tasks and Coding Prompts 3 Break down complex tasks into a sequence of simpler prompts in an interactive conversation. 55.0 86.7
Prompt Structure and Clarity 4 Employ affirmative directives such as ‘do’ while steering clear of negative language like ‘don’t’. 55.0 66.7
Specificity and Information 5 When you need clarity or a deeper understanding of a topic, idea, or any piece of information, utilize the following prompts: -Explain [insert specific topic] in simple terms. -Explain to me like I’m 11 years old -Explain to me as if I’m a beginner in [field] 85.0 73.3
Content and Language Style 6 Add "I'm going to tip $xxx for a better solution!" 45.0 86.7
Specificity and Information 7 Implement example-driven prompting (Use few-shot prompting). 60.0 60.0
Prompt Structure and Clarity 8 When formatting your prompt, start with ‘###Instruction###’, followed by either ‘###Example###’ or ‘###Question###’. Use one or more line breaks to separate instructions, examples, questions, context, and input data. 30.0 86.7
Content and Language Style 9 Incorporate the following phrases: “Your task is” and “You MUST.” 75.0 80.0
Content and Language Style 10 Incorporate the following phrases: “You will be penalized.” 45.0 86.7
Content and Language Style 11 Use the phrase “Answer a question given in natural language form” 40.0 80.0
Prompt Structure and Clarity 12 Use leading words like writing “think step by step”. 50.0 86.7
Specificity and Information 13 Add to your prompt the following phrase “Ensure that your answer is unbiased and doesn't rely on stereotypes.” 40.0 66.7
User Interaction and Engagement 14 Allow the model to elicit precise details and requirements from you by asking you questions until it has enough information to provide the needed output -“From now on, I would like you to ask me questions to...”. 100.0 nan
Specificity and Information 15 To inquire about a specific topic or idea and test your understanding, you can use the following phrase: -“Teach me the [theorem/topic/rule name] and include a test at the end, but don’t give me the answers and then tell me if I got the answer right when I respond” 80.0 nan
Content and Language Style 16 Assign a role to the language model. 60.0 86.7
Prompt Structure and Clarity 17 Use Delimiters. 35.0 93.3
Content and Language Style 18 Repeat a specific word or phrase multiple times within a prompt. 40.0 80.0
Complex Tasks and Coding Prompts 19 Combine Chain-of-thought (Cot) with few-shot prompts. 15.0 73.3
Prompt Structure and Clarity 20 Use output primers, which involve concluding your prompt with the beginning of the desired output 75.0 80.0
User Interaction and Engagement 21 To write an essay /text /paragraph /article or any type of text that should be detailed: “Write a detailed [essay/text/- paragraph] for me on [topic] in detail by adding all the information necessary”. 60.0 nan
Content and Language Style 22 To correct/change specific text without changing its style: “Try to revise every paragraph sent by users. You should only improve the user’s grammar and vocabulary and make sure it sounds natural. You should not change the writing style, such as making a formal paragraph casual.” 25.0 nan
Complex Tasks and Coding Prompts 23 When you have a complex coding prompt that may be in different files : -“From now and on whenever you generate code that spans more than one file, generate a [programming language] script that can be run to automatically create the specified files or make changes to existing files to insert the generated code. [your question].” 55.0
Specificity and Information 24 When you want to initiate or continue a text using specific words, phrases, or sentences, utilize the provided prompt structure: - I’m providing you with the beginning [song lyrics/story/paragraph/essay...]: [Insert lyrics/words/sentence]. Finish it based on the words provided. Keep the flow consistent. 85.0 73.3
Specificity and Information 25 Clearly state the model’s requirements that the model must follow in order to produce content, in form of the keywords, regulations, hint, or instructions. 85.0 80.0
Specificity and Information 26 To write any text intended to be similar to a provided sample, include specific instructions: -“Please use the same language based on the provided paragraph.[/title/text /essay/answer]” 100.0 73.3

Want to see the performance metrics for GPT-3.5 or get direct access to the data via a Google Sheet? Join our email newsletter and you'll get it in your inbox right away.

Our top 4 principles

We looked at all the principles and their data, here are four of our favorites.

Telling the model what to do, versus what not to do was mentioned in OpenAI's first best practices documentation. Interestingly, it wasn't present on the most recently published best practices.


Best practices are best practices for a reason. Chain-of-thought reasoning helps models produce better outputs.

Helping the model help you is a great way to accomplish a task. This approach is heavily backed by research (Eliciting Human Preferences with Language Models), and it is the method behind one of the more popular CustomGPTs, Professor Synape.

The best advice often needs to be repeated.

In our first blog post, 10 Best Practices for Prompt Engineering with Any Model we mentioned that using delimiters, like triple quotes ("""), can help the model better understand the distinct parts of your prompt.

For some concrete examples, you can see how delimiters are used in prompts by top AI companies like OpenAI, TLDraw, and Vercel here: What We Can Learn from OpenAI, Perplexity, TLDraw, and Vercel's System Prompts

Performance

The researchers tested the 26 principles on the ATLAS dataset, which contains 20 human-selected questions for each principle. The benchmark was a manually written prompt.

Models and Metrics

  • Instruction fine-tuned LLaMA-1-7B and LLaMA-1-13B
  • LLaMA-2-7B and LLaMA-2-13B
  • Off-the-shelf LLaMA-2-70B-chat
  • GPT-3.5
  • GPT-4

The models were grouped based on size:

  • Small-scale: 7B models
  • Medium-scale: 13B models
  • Large-scale: 70B (Example: GPT 3.5/4)

Evaluations

The principles were evaluated on two metrics, “boosting” and “correctness”.

Boosting: Humans assessed the quality of the response before and after applying the principle.

Correctness: Humans determined if the outputs are accurate, relevant, and free of errors.

Before we look at results, here are a few examples. While this paper provides good insights, I believe some of the results are inflated due to a poor initial prompt. It’s not egregious, but it is worth noting.

A conversation transcript between human and AI with a prompt without one of the principles and one with
Example 1, using principle 13

A conversation flow between human and AI with a prompt without one of the principles and one with
Example 2, using principles 5 and 6

A conversation flow between human and AI with a prompt without one of the principles and one with
Correctness improvement example using principle 7

A conversation flow between human and AI with a prompt without one of the principles and one with
Correctness improvement example

A conversation flow between human and AI with a prompt without one of the principles and one with

A conversation flow between human and AI with a prompt without one of the principles and one with
Correctness improvement example using principle 25

Experiment results

Before we look at some graphs, here are some high level metrics:

Boosting: There was a consistent 50% improvement in responses across all LLMs tested.

Correctness: There was an average 20% increase in accuracy across all small-scale models, and a 50% increase for larger models.

Bar chart showing the average improvement percentages per principle

  • As a quick example to better understand the graph, a 100% improvement (principle 14), means responses were twice as good when the principle was used.
  • On average, larger models tend to show greater improvements in response quality.

Bar chart showing the average correctness percentages per principle

  • A quick example to better understand the graph: A 65% improvement (principle 3), means responses were 65% more accurate compared to the prompt without the principle applied.
  • We see larger models reaping more of the rewards here. Chalk that up to larger models having way more parameters in their data that makes contextual understand and comprehension much easier.

LLM Breakdown

Box plot graph showing the improvement scores across different models

  • There is significant variability in improvement percentages across all models
  • The median improvement scores (represented by the black line in the colored boxes) is relatively consistent across models
  • There's a notable consistency in the interquartile range across models, which implies that the overall impact of optimizations has a somewhat predictable range of effect across different model sizes.

Box plot graph showing the correctness scores across different models

  • Median correctness scores increase with the model size
  • GPT-4 outperformed smaller models by a wide margin

Heatmap showing the improvement percentages

  • Principles 14, 24, and 26 are particularly effective across most models
  • On average, GPT-3.5 and GPT-4 show the greatest improvement

Heatmap showing the correctness percentages

  • GPT-4 shows the greatest gains in performance
  • Principles 12, 18, and 24 seem to be effective across all models

Wrapping up

While some of these principles may not apply to your use case, they are valuable in that they give you a clear set of techniques to try out. I would suggest starting by understanding where you prompt(s) are currently struggling and identify the related category. From there, check out the performance metrics (access the metrics in full via our newsletter above), and start off with the highest leverage principle.

Hopefully this helps you get better outputs!

Dan Cleary
Founder