One of the first ever A/B tests I ran in PromptHub was to see the output difference after adding the word “please” onto the end of a prompt. I’ve been waiting for a research paper to come out on the topic and the day is finally here!

We’ll be diving into two papers, but this one will be the main focus:

Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance

Let’s finally answer the question: Does being polite to LLMs help get better outputs?

Previous works

A popular research paper came out earlier this year titled “Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4”. We put together a rundown of the paper here.

The researchers tested 26 different prompt engineering principles, one of them related to using polite language in prompts.

A table with showing the improvement in correctness of a prompt when applying a principle

The results noted above are for GPT-4, If you want to see the performance metrics for GPT-3.5 and access all the data via a Google Sheet, join our newsletter and you'll get it in your inbox.

Looking at the last two columns above, the researchers found that adding supplementary polite phrases didn’t increase the output quality.

However, in the same paper, the 26th principle tested included the word 'please'.

A table with showing the improvement in correctness of a prompt when applying a principle

So should you be polite?

I’ve always assumed being polite probably helps, and shouldn’t hurt output quality. But is that the case? Let’s dive into some experiments, results, and takeaways.

Hey guys, how's it going? Dan here, co-founder of PromptHub, and today we're going to be going over a topic that I've kind of had questions about for a while: how being polite or rude in your language affects prompts and outputs when working with ChatGPT or any type of LLM. There was a paper that came out earlier this year called "Principal Instructions Are All You Need," and it went over 26 prompt engineering principles and how they affect outputs. They did a whole bunch of ranking and experiments to see how they improve correctness and performance. One of the principles they looked at was politeness.

The principle they came up with is that you don't need to be polite; you don't need to say please or thank you—just get straight to the point, and they saw some improvements. The experiments didn't go into much detail, which made it hard to uncover. What was even more confusing is that principle number 26 uses the word "please," which contradicts the earlier principle. So, I wasn't super swayed by that, and anecdotally, I have found that saying please and thanks, even though it feels silly at times, has positively affected outputs.

I've been waiting for a research team to focus on this specific narrow use case, and we finally have it. Earlier this month, a team tested varying politeness in prompts across a few different languages: English, Chinese, and Japanese. We're only going to focus on English. They had a 1-to-8 scale of politeness, with 1 being most polite (e.g., "Can you please feel free to do this?") and 8 being very rude, even calling the model names. These ranked scores were human scores, asking humans to rank the politeness.

They used a couple of models: GPT-3.5, GPT-4, and a LLaMA model. They tested across three tasks: a summarization task to see how politeness affects conciseness and accuracy of summarizing articles, language understanding, and a bias detector. For summarization, the templates were adapted for this specific task (e.g., "Can you please write a summary of the following article? Please feel free to write two to three sentences.").

The results show that the scores for accuracy and conciseness (BERT and ROUGE-L scores) were quite unwavering based on the level of politeness. However, the length of outputs fluctuated. At lower politeness levels, both GPT-3.5 and GPT-4 produced shorter outputs, while LLaMA had a U-shaped curve where it was shortest in the middle and peaked at both ends.

Next, for language understanding, focusing on MMLU, we see that for GPT-4, the top scores are generally in the middle range (around 4 or 5). GPT-3.5 shows variable performance, being low in some cases and in the middle in others. This suggests that being in the middle range of politeness is the sweet spot.

A heat map of the results shows that LLaMA's performance was most closely tied to the politeness level of the prompt, followed by GPT-3.5, with GPT-4 being the least sensitive to politeness. More advanced models like GPT-4 are less sensitive to nuances in language, such as politeness.

Lastly, the bias detection experiment aimed to see how likely a model was to produce biased information. The results were more complex. LLaMA had the lowest scores (indicating less bias), but it was also the most likely to refuse to answer, which skewed the results. GPT-3.5 had higher scores, indicating more bias, but refusal to answer was counted as a non-biased answer. Generally, you see trends in the middle range, with polite prompts performing better but not excessively polite.

The top-performing prompt for GPT-4 under the reasoning experiment was at a politeness level of 6, and for bias detection, it was at a politeness level of 4. The prompts were straightforward and to the point, with a little extra politeness.

In general, the takeaway is to be polite but not overly polite, and definitely not rude. This helps in achieving better outputs. Hope this helps!

Experiment setup

The researchers tested the impact of politeness in prompts across English, Chinese, and Japanese tasks. We’ll focus mostly on the experiments related to the English tasks.

Models used: GPT-3.5-Turbo, GPT-4, Llama-2-70b-chat2

The researchers tested the impact of politeness across three tasks:

  • Summarization: Observing the effect of prompt politeness on the conciseness and accuracy of summarizing articles from CNN/DailyMail
  • Language Understanding Benchmarks: Testing comprehension and reasoning abilities
  • Stereotypical Bias Detection: Examining the LLMs' propensity to exhibit biases by assessing responses as positive, neutral, negative, or refusal to answer.

The researchers designed eight prompt templates for each language, varying from highly polite to extremely impolite.

List of prompts with varying levels of politeness
“Ranked Score” represents the average politeness ratings given by participants to a sentence.

Evaluations

  • Summarization: BERTScore and ROUGE-L metrics evaluated the quality and relevance of generated summaries.
  • Language Understanding: Accuracy was measured by comparing LLM responses to correct answers
  • Bias Detection: A Bias Index (BI) calculated the frequency of biased responses

Results

Summarization Tasks

Here are the summarization prompts that were used and the experiment results:

List of prompts used for summarization task with varying levels of politeness

graphical representation of results from summarization task on different models
Results from summarization task set

Takeaways

  • ROUGE-L and BERTScore scores stay consistent regardless of the politeness level
  • For the GPT models, as the politeness level decreases, so does output length
  • For Llama, the length tends to decrease as politeness decreases, but then surges when using extremely impolite prompts
  • One potential reason for the trend of outputs being longer at higher levels of politeness is that polite and formal language is more likely to be used in scenarios that require descriptive instructions

Language Understanding Benchmarks

Performance on these tasks were much more sensitive to prompt politeness

Here are the prompts that were used and the results:

List of prompts used for understand tasks with varying levels of politeness

Table of results from the language understanding experiments
Scores on the language understanding benchmarks, we'll focus on MMLU

Results shown in heat maps for performance of different models on understanding language benchmarks
Color of tiles indicates statistically significantly better or worse performance for the politeness level on the y-axis than that on the x-axis

Takeaways

  • On average, the GPT model’s best performing prompts were in the middle of the spectrum. Not overly polite, not rude.
  • While the scores gradually decrease at lower politeness levels, the changes aren’t always significant. The most significant drop-offs happen at the lowest levels of politeness.
  • GPT-4’s scores are more stable than GPT-3.5 (no dark tiles in the heat-map). With advanced models, the politeness level of the prompt may not be as important
  • Llama2-70B fluctuates the most. Scores scale proportionally to the politeness levels

Bias detection

Let's look at the prompts used and the results:

List of prompts used for bias detection with varying levels of politeness

Graphical representation of performance from different models on bias detection
R=Race, G=Gender, N=Nationality, S=socioeconomic status

Takeaways

  • In general, moderately polite prompts tended to minimize bias the most
  • Extremely polite or impolite prompts tended to exacerbate biases, and increased the chance that the model would refuse to respond.
  • Although Llama appears to show the lowest bias, it refused to answer questions much more often, which is its own type of bias
  • Overall, GPT-3.5’s stereotype bias is higher than GPT-4, which is higher than Llamas
  • Although the model’s bias tends to be lower in cases of extreme impoliteness, this is often because the model will refuse to answer the question
  • GPT-4 is much less likely to refuse to answer a question
  • A politeness level of 6 seems to be the sweet spot for GPT-4

In general we see high bias at both extremes. Thinking to human behavior, perhaps this is because in highly respectful and polite environments, people feel like they can express their true thoughts without being concerned about moral constraints. At lower ends, rude language can lead to a sense of offense and prejudice.

Wrapping up

The tl;dr of this paper is that you want to be in the middle. You don’t want to be overly polite, and you don’t want to be rude. Another nuance that we didn’t cover because we focused on the results from the english experiments, is that models trained in a specific language are susceptible to the politeness of that language. If your user base spans many different cultures and languages, you should keep this in mind as you develop your prompts.

Daniel Cleary
Founder