Hallucinations are a big problem in the AI space. If you’re using any generative AI tools, (ChatGPT included), the likelihood of receiving incorrect information is high.
The study introduced a method called “According to…”. According-to prompting directs LLMs to ground their responses using data from their pre-training set. This method was inspired by journalists' practice of quoting information “according to sources”.
It involves adding some text to a prompt that instructs the model to source information from a specific (and trusted) source, like Wikipedia.
The core of “According to...” prompting
Adding the “According to” phrase to your prompt increases the probability that LLMs will ground their response in data they’ve been trained on, rather than making things up (i.e. hallucinating).
This method goes beyond referencing just one source like Wikipedia. It can pull from anything that is known to be in the pre-trained data. I've listed a few examples below.
- "Based on the rulings in Harvard Law Review..."
- "According to the World Health Organization's latest report..."
- "Based on the reviews in Rotten Tomatoes..."
- "As per the insights from Bloomberg's market data..."
- "As highlighted in the latest issue of MIT Technology Review..."
- "Based on the curriculum guidelines from the Department of Education..."
- "Based on the data from the Environmental Protection Agency (EPA)..."
We put together a simple template so that you can try this method out in PromptHub (link here).
If you don't have PromptHub access but want to try it out, reply to the email that gets sent when you join the waitlist and I'll share an access code with you
The researchers ran experiments across a variety of open and closed source models.
They used open-source datasets (Natural Questions, TriviaQA, HotpotQA, and ELI5) to gather questions and tasks for the experiment.
The main goal was to compare the grounding effectiveness of the “According to...” method. The main question was “when we add these types of phrases to prompts, does output better reflect the exact data from the pre-trained set?”.
For each dataset, the model was presented with a question or task.
Measurement with QUIP-Score
To ensure that the model's responses were genuinely rooted in its pre-training data, the researchers used a tool called Data Portraits. This tool allowed them to quickly determine if the model’s output was directly pulled from its training data.
Data Portraits work by indexing a large corpus (like Wikipedia) and then performing fast lookups to see if a particular sequence of words in the model's output was present in the indexed data.
To quantify the grounding of the model’s output, the researchers created a metric called the QUIP-Score (Quoted Information Precision). A higher score means a more significant portion of the output corresponds with its training data.
A few notes about the graph above
- The top section is the null prompt (no additional prompt other than the question or task)
- The middle section includes the grounding prompts
- The last section includes the anti-grounding prompts
- Colored cells indicate gains (green), losses (red), or the same (gray)
Grounding effectiveness compared to traditional prompting
Outputs generated using the "According to..." method were consistently more grounded in factual data from the training corpus and achieved higher QUIP-Scores compared to traditional prompting (usually by 5-15%).
While the primary focus was on grounding, the researchers noted that the “According to…” method sometimes also improved the quality of the responses.
One very important thing to note is that a high grounding or QUIP score doesn’t always equate to the correctness of the answer.
Anti-grounding prompts, which either discouraged grounding or instructed the model to anchor its answers in other corpora, typically led to diminished grounding to the pre-trained data and lower QUIP-Scores.
Additionally, in tasks that relied heavily on Wikipedia content, the use of anti-grounding prompts also led to a decrease in performance as well.
Impact of model size
The experiments spanned across various models, of different sizes. As the model size increased, so did its ability to effectively ground its responses.
Impact of frequency
The study showed that text that is frequently present in the training data is more likely to be accurately referenced in the model's output.
Example prompts and responses
Incorporating "According to..." in fine-tuned models
A challenge when using the “According to…” method on models that are closed (like OpenAI models), is that you can’t be 100% sure of what is in the training data. But if you fine tune your own model, you’ll know with greater detail. Grounding responses in reliable pre-trained data can bolster the accuracy and reliability of these specialized models.
An important thing to keep in mind is one of the fundamental rules of prompt engineering, “give the model room to think”. You don’t want grounding to overly restrict the model’s creative or problem-solving capabilities. For example, if you attempt to ground a query about law to an entertainment corpus, you might end up with incorrect results.
The “According to…” method is another tool you can add to your prompt engineering toolbelt.
If you’re interested in other prompt engineering methods I would recommend checking out our other articles: