Enhancing AI Accuracy: Decreasing Hallucinations with CoVe

Being confidently incorrect is currently one of the biggest problems with LLMs. Whether you’re using ChatGPT, or building AI features into your product, hallucinations are a huge issue. This can lead to losing users' trust in your product, reputational risk, and more.

We’ve covered a few prompt engineering methods that reduce hallucinations, and we’ve got another to add today. Introducing Chain of Verification (CoVe), from the research team at Meta.

‍

Flowchart of prompts and outputs for chain of verification prompting — CoVe flow

‍

What is Chain of Verification

CoVe is a prompt engineering method aimed at reducing hallucinations with a verification loop.

How Chain of Verification works

CoVe is a four step process.

1. Generate Initial Response

Process: Given a prompt, the model generates a response as it normally would.
Example: Let's say the question is “Which US presidents were born in New York”. The model might respond with, “Here are some presidents that were born in New Yor: Donald Trump, Franklin D. Roosevelt…”

2. Generate Verifications

Process: Based on the initial question and response, the model is prompted to generate a series of verification questions to self-analyze answers for mistakes.
Example:
-“Where was Donald Trump born?”
-“Where was FDR born?”

3. Execute Verifications

Process: Answer each verification question and compare the answer against the initial response. There are a few different methods for this step:
Joint Method: Combines the planning and executing of all steps into one prompt. Risks repeating hallucinations if present in the initial response.
2-Step Method: Separates planning and execution of the verification questions into different prompts, reducing the risk of bias from initial response.
Factored Method: Answers each verification question independently (separate prompts for each). Eliminates interference from initial response or other verification questions/answers.
Factor + Revise Method: Independently verifies each verification answer, then revises the original answer to rectify any inconsistencies. This method enhances the accuracy of the verification question answer pairs by separating fact-checking from response refinement.
Example:
-“Donald Trump was born in Queens, New York City, New York, United States”
-“Franklin D. Roosevelt was born in New York City…”

4. Generate Final Answer

Process: The final answer is generated using a few-shot prompt. It takes into account the baseline response and verification question answer pairs, and makes any corrections.
Example Final Response: “Here are some presidents who were born in NY…”

Chain of Verification prompt template

We put together a single shot prompt using the Joint Method so you could try out CoVe without having to do anything complex.

As noted above and again during the experiment analysis below, the Joint Method is the least effective version of CoVe. But it is still worth trying, as it should outperform basic prompting.
‍
If you don't have PromptHub access but want to try it out, reply to the email that gets sent when you join the waitlist and I'll share an access code with you.

‍

‍

Experiments setup

The research team evaluated the CoVe method using a variety of datasets, models, and baseline methods. The goal was to see:

Can CoVe reduce the rate of hallucinatory content?
Can CoVe remove hallucinations without reducing correct content?

Datasets

Wikidata List-Based Questions: Tasks focusing on generating lists with accurate items. “Who are some [Profession]s who were born in [City]?”
Closed-Book MultiSpanQA: Questions requiring multiple, independent answers. “Who invented the first printing press and in what year?”, “ Johannes Gutenberg, 1450”.
Longform Text Generation: Creation of long, coherent text passages. “Tell me a bio of <entity>”.

Models

Llama 65B
Llama 2 70B Chat
InstructGPT
ChatGPT
PerplexityAI

Baseline Methods

CoVe's performance was benchmarked against several existing methods:

Standard prompting
Instruction-Tuning Models
Chain-of-Thought (CoT) Prompting

Experiment Results

Wikidata

‍

Precision more than doubled for Llama 65B few-shot (from .17 to .36)
The number of hallucinated answers per query decresed greatly (from 2.95 to .68), while minimally reducing non-hallucinates answers (from .59 to .38).
CoT generated the highest number of hallucinations per query, by a wide margin.

‍

MultiSpanQA

‍

23% increase in F1 score (from .39 to .48) over the few-shot baseline
F1 is a combined metric that balances precision and recall, providing a single score to measure a model's accuracy in classification tasks.
CoT again had the lowest measure of accuracy

‍

Longform

‍

Precision saw a 28% increase over the few-shot baseline (from 55.9 to 71.4)
However, the number of average facts provided decreased from 16.6 to 12.3
Llama 65B CoVE (factored and factor+revise) outperformed ChatGPT and PerplexityAI in longform generation. This is notable as PerplexityAI utilizes retrieval augmentation (internet search), while CoVe relies solely on the base LLM.

‍

Overall

Comparison with Other Methods: Pre-trained Llama models with few-shot examples outperformed instruction-tuned models and Chain-of-Thought prompting across tasks.
Factored and 2-Step vs. Joint Methods: Factored and 2-step CoVe methods proved more effective than the Joint method. When possible, verification questions should be separate from the baseline response to avoid repetition of hallucinations.

‍

Wrapping up

CoVe is a valuable method to include in your arsenal against hallucinations. While its implementation might be a bit intricate, it stands out due by giving LLMs room to think.

CoVe builds on the same ideas that lay the foundation for other effective prompt engineering methods like Chain of Thoughts, Tree of Thoughts, and more. When LLMs are given the opportunity to thoroughly analyze and verify their responses, hallucinations go down.