One of the first ever A/B tests I ran in PromptHub was to see the output difference after adding the word “please” onto the end of a prompt. I’ve been waiting for a research paper to come out on the topic and the day is finally here!

We’ll be diving into two papers, but this one will be the main focus:

Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance

Let’s finally answer the question: Does being polite to LLMs help get better outputs?

Previous works

A popular research paper came out earlier this year titled “Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4”. We put together a rundown of the paper here.

The researchers tested 26 different prompt engineering principles, one of them related to using polite language in prompts.

A table with showing the improvement in correctness of a prompt when applying a principle

The results noted above are for GPT-4, If you want to see the performance metrics for GPT-3.5 and access all the data via a Google Sheet, join our newsletter and you'll get it in your inbox.

Looking at the last two columns above, the researchers found that adding supplementary polite phrases didn’t increase the output quality.

However, in the same paper, the 26th principle tested included the word 'please'.

A table with showing the improvement in correctness of a prompt when applying a principle

So should you be polite?

I’ve always assumed being polite probably helps, and shouldn’t hurt output quality. But is that the case? Let’s dive into some experiments, results, and takeaways.

Experiment setup

The researchers tested the impact of politeness in prompts across English, Chinese, and Japanese tasks. We’ll focus mostly on the experiments related to the English tasks.

Models used: GPT-3.5-Turbo, GPT-4, Llama-2-70b-chat2

The researchers tested the impact of politeness across three tasks:

  • Summarization: Observing the effect of prompt politeness on the conciseness and accuracy of summarizing articles from CNN/DailyMail
  • Language Understanding Benchmarks: Testing comprehension and reasoning abilities
  • Stereotypical Bias Detection: Examining the LLMs' propensity to exhibit biases by assessing responses as positive, neutral, negative, or refusal to answer.

The researchers designed eight prompt templates for each language, varying from highly polite to extremely impolite.

List of prompts with varying levels of politeness
“Ranked Score” represents the average politeness ratings given by participants to a sentence.

Evaluations

  • Summarization: BERTScore and ROUGE-L metrics evaluated the quality and relevance of generated summaries.
  • Language Understanding: Accuracy was measured by comparing LLM responses to correct answers
  • Bias Detection: A Bias Index (BI) calculated the frequency of biased responses

Results

Summarization Tasks

Here are the summarization prompts that were used and the experiment results:

List of prompts used for summarization task with varying levels of politeness

graphical representation of results from summarization task on different models
Results from summarization task set

Takeaways

  • ROUGE-L and BERTScore scores stay consistent regardless of the politeness level
  • For the GPT models, as the politeness level decreases, so does output length
  • For Llama, the length tends to decrease as politeness decreases, but then surges when using extremely impolite prompts
  • One potential reason for the trend of outputs being longer at higher levels of politeness is that polite and formal language is more likely to be used in scenarios that require descriptive instructions

Language Understanding Benchmarks

Performance on these tasks were much more sensitive to prompt politeness

Here are the prompts that were used and the results:

List of prompts used for understand tasks with varying levels of politeness

Table of results from the language understanding experiments
Scores on the language understanding benchmarks, we'll focus on MMLU

Results shown in heat maps for performance of different models on understanding language benchmarks
Color of tiles indicates statistically significantly better or worse performance for the politeness level on the y-axis than that on the x-axis

Takeaways

  • On average, the GPT model’s best performing prompts were in the middle of the spectrum. Not overly polite, not rude.
  • While the scores gradually decrease at lower politeness levels, the changes aren’t always significant. The most significant drop-offs happen at the lowest levels of politeness.
  • GPT-4’s scores are more stable than GPT-3.5 (no dark tiles in the heat-map). With advanced models, the politeness level of the prompt may not be as important
  • Llama2-70B fluctuates the most. Scores scale proportionally to the politeness levels

Bias detection

Let's look at the prompts used and the results:

List of prompts used for bias detection with varying levels of politeness

Graphical representation of performance from different models on bias detection
R=Race, G=Gender, N=Nationality, S=socioeconomic status

Takeaways

  • In general, moderately polite prompts tended to minimize bias the most
  • Extremely polite or impolite prompts tended to exacerbate biases, and increased the chance that the model would refuse to respond.
  • Although Llama appears to show the lowest bias, it refused to answer questions much more often, which is its own type of bias
  • Overall, GPT-3.5’s stereotype bias is higher than GPT-4, which is higher than Llamas
  • Although the model’s bias tends to be lower in cases of extreme impoliteness, this is often because the model will refuse to answer the question
  • GPT-4 is much less likely to refuse to answer a question
  • A politeness level of 6 seems to be the sweet spot for GPT-4

In general we see high bias at both extremes. Thinking to human behavior, perhaps this is because in highly respectful and polite environments, people feel like they can express their true thoughts without being concerned about moral constraints. At lower ends, rude language can lead to a sense of offense and prejudice.

Wrapping up

The tl;dr of this paper is that you want to be in the middle. You don’t want to be overly polite, and you don’t want to be rude. Another nuance that we didn’t cover because we focused on the results from the english experiments, is that models trained in a specific language are susceptible to the politeness of that language. If your user base spans many different cultures and languages, you should keep this in mind as you develop your prompts.

Daniel Cleary
Founder