Hallucinations are a big problem in the AI space. If you’re using any generative AI tools, (ChatGPT included), the likelihood of receiving incorrect information is high.

Our friends over at Synthminds shared a study with us from Johns Hopkins University that introduces a prompting method that can help reduce hallucinations. (You can check it out here).

The study introduced a method called “According to…”. According-to prompting directs LLMs to ground their responses using data from their pre-training set. This method was inspired by journalists' practice of quoting information “according to sources”.

It involves adding some text to a prompt that instructs the model to source information from a specific (and trusted) source, like Wikipedia.

Graphic displaying messages between human and AI, using the according to method
After adding in the "According to" phrase, the output becomes more grounded to factual data

The core of “According to...” prompting

Adding the “According to” phrase to your prompt increases the probability that LLMs will ground their response in data they’ve been trained on, rather than making things up (i.e. hallucinating).

This method goes beyond referencing just one source like Wikipedia. It can pull from anything that is known to be in the pre-trained data. I've listed a few examples below.

Grounding Examples

Law:

  • "Based on the rulings in Harvard Law Review..."

Medicine:

  • "According to the World Health Organization's latest report..."

Entertainment:

  • "Based on the reviews in Rotten Tomatoes..."

Finance:

  • "As per the insights from Bloomberg's market data..."

Technology:

  • "As highlighted in the latest issue of MIT Technology Review..."

Education:

  • "Based on the curriculum guidelines from the Department of Education..."

Environment:

  • "Based on the data from the Environmental Protection Agency (EPA)..."

We put together a simple template so that you can try this method out in PromptHub (link here).

If you don't have PromptHub access but want to try it out, reply to the email that gets sent when you join the waitlist and I'll share an access code with you

The "according to" template in the PromptHub platform

Experiments: Setup

The researchers ran experiments across a variety of open and closed source models.

They used open-source datasets (Natural Questions, TriviaQA, HotpotQA, and ELI5) to gather questions and tasks for the experiment.

Experiment design

The main goal was to compare the grounding effectiveness of the “According to...” method. The main question was “when we add these types of phrases to prompts, does output better reflect the exact data from the pre-trained set?”.

For each dataset, the model was presented with a question or task.

Measurement with QUIP-Score

To ensure that the model's responses were genuinely rooted in its pre-training data, the researchers used a tool called Data Portraits. This tool allowed them to quickly determine if the model’s output was directly pulled from its training data.

Data Portraits work by indexing a large corpus (like Wikipedia) and then performing fast lookups to see if a particular sequence of words in the model's output was present in the indexed data.

To quantify the grounding of the model’s output, the researchers created a metric called the QUIP-Score (Quoted Information Precision). A higher score means a more significant portion of the output corresponds with its training data.

Experiments: Results

Overview

Table of showing the performance differences between the null prompt, grounded props, and anti-grounded prompts
Performance of the null, grounded, and anti-grounded prompts

A few notes about the graph above

  • The top section is the null prompt (no additional prompt other than the question or task)
  • The middle section includes the grounding prompts
  • The last section includes the anti-grounding prompts
  • Colored cells indicate gains (green), losses (red), or the same (gray)

Grounding effectiveness compared to traditional prompting

A table comparing how different models performed with grounded prompts vs the null prompt
Percentage improvements when using the grounding prompt compared to the null prompt across models

Outputs generated using the "According to..." method were consistently more grounded in factual data from the training corpus and achieved higher QUIP-Scores compared to traditional prompting (usually by 5-15%).

While the primary focus was on grounding, the researchers noted that the “According to…” method sometimes also improved the quality of the responses.

One very important thing to note is that a high grounding or QUIP score doesn’t always equate to the correctness of the answer.

Anti-grounding prompts

Anti-grounding prompts, which either discouraged grounding or instructed the model to anchor its answers in other corpora, typically led to diminished grounding to the pre-trained data and lower QUIP-Scores.

Additionally, in tasks that relied heavily on Wikipedia content, the use of anti-grounding prompts also led to a decrease in performance as well.

Impact of model size

The experiments spanned across various models, of different sizes. As the model size increased, so did its ability to effectively ground its responses.

2 graphs demonstrating results from the experiment, based on model size
The larger the model, the better its ability to effectively ground its responses

Impact of frequency

The study showed that text that is frequently present in the training data is more likely to be accurately referenced in the model's output.

2 bar graphs showing how the frequency of an entity in training data related to its QUIP-Score
The more frequent a piece of info appeared in the training data, the easier it could be grounded

Example prompts and responses

table of prompts and outputs from null prompt and grounded prompts
Outputs from null prompts and grounded prompts

Incorporating "According to..." in fine-tuned models

A challenge when using the “According to…” method on models that are closed (like OpenAI models), is that you can’t be 100% sure of what is in the training data. But if you fine tune your own model, you’ll know with greater detail. Grounding responses in reliable pre-trained data can bolster the accuracy and reliability of these specialized models.

An important thing to keep in mind is one of the fundamental rules of prompt engineering, “give the model room to think”. You don’t want grounding to overly restrict the model’s creative or problem-solving capabilities. For example, if you attempt to ground a query about law to an entertainment corpus, you might end up with incorrect results.

Wrapping up

The “According to…” method is another tool you can add to your prompt engineering toolbelt.

If you’re interested in other prompt engineering methods I would recommend checking out our other articles:

Happy prompting!

Dan Cleary
Founder