If you frequent this blog, you know that we really like prompt engineering methods that increase performance, while being easy to implement. That is why this recent Deepmind study stood out to us: Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models.

The study introduced a new prompting method called Step-Back Prompting. A method that showed improvements of up to 36% over chain of thought (CoT) prompting.

A flow chart comparing chain-of-thought prompting to step-back prompting

What is Step-Back Prompting

Step-Back Prompting draws inspiration from the human tendency to pause and reflect when we are first faced with a challenging task or question. We look for higher level concepts or principles to guide our thinking. For example if tasked with figuring out the length of a side of a triangle, we may first recall the Pythagorean theorem.

Step-Back Prompting is motivated by the observation that many tasks that we assign to LLMs are full of implicit and explicit details. LLMs can have a hard time retrieving relevant facts when tackling these types of tasks.

How Step-Back Prompting works

Step-Back Prompting involves just adding one additional prompt to give the model the freedom to do some abstract thinking before addressing the primary question.

Step-Back Prompting is broken down into 2 steps

  • Abstraction: Rather than addressing the question head-on, we would first prompt the LLM to ask a more generic question about a high-level concept, still related to the main question
  • Reasoning: Using the first prompt and answer as a grounding mechanism, the LLM can now more accurately reason about a solution to the main question

For example if the main question was ‘What specific steps should I take to reduce my energy consumption at home?', the step-back question may be 'What are the general principles of energy conservation?'. Or, instead of diving straight into 'How do I fix the error in this specific line of code?', a step-back question may be 'What are the common causes of this type of error?'.

A real-world example

Before diving into the experiment results, a quick example.

Let's say we want to know how many U.S. presidents were born in the United States.

We'll compare direct prompting and Step-Back Prompting side-by-side using PromptHub's testing tools.

Here are the prompts:


2 prompt outputs side-by-side in PromptHub interface

Here are the outputs:

Two outputs side-by-side in the PromptHub interface

The proof is in the pudding. Direct prompting misses out on Franklin Roosevelt. This just goes to show how a little bit of prompt engineering can go a really long way to getting better, more accurate results.

Want to try it out for yourself? Here's a single-shot template in PromptHub you can try.

If you don't have PromptHub access but want to try it out, reply to the email that gets sent when you join the waitlist and I'll share an access code with you.

A screenshot of the Step-Back prompt as a template in PromptHub

Experiments Setup

The researchers tested Step-Back Prompting across 3 datasets, 2 models, and various prompting methods (few-shot, CoT, direct prompting, and more).

Datasets

  • STEM (Science, Technology, Engineering, and Mathematics): Tasks that required analytical thinking and precision.
  • Knowledge QA (Question Answering): Scenarios where the model had to retrieve and provide accurate information.
  • Multi-Hop Reasoning: Complex questions necessitating the connection of multiple pieces of information to deduce the correct answer.

Models

  • PaLM-2L
  • GPT-4

Baseline methods

Step-Back Prompting was measured against a few prompting methods:

  • Direct prompting
  • Chain of Thought (CoT) Prompting
  • Take a Deep Breath (TDB) Prompting
  • Retrieval-Augmented Generation (RAG)

Experiments Results

STEM Tasks:

Table of results from STEM experiments

Takeaways:

  • Step-Back Prompting improves the responses from PaLM-2L drastically
  • Step-Back Prompting outperformed CoT and GPT-4
  • It would be interesting to see the accuracy of GPT-4 + Step-Back!

Knowledge QA:

Table of results from Knowledge QA experiments

Takeaways:

  • Step-Back prompting is able to perform well, especially on hard questions ( See column "TQA Hard")
  • GPT-4 outperformed Step-Back and Step-Back + RAG on the SituatedQA test set
  • Step-Back + Retrieval Augmented Generation (RAG) produced even better results than just Step-Back. Highlighting the importance to combine prompt engineering methods
  • Given how easy Step-Back Prompting is to integrate, it is almost surprising how much better the results are compared to direct prompting

Multi-Hop Reasoning:

Table of results from Multi-Hop Reasoning experiments

Takeaways:

  • Baseline performance of PaLM-2L and GPT4 are low in MuSiQue because it requires multiple reasoning steps
  • Step-Back Prompting outperforms GPT-4 in both sets

Wrapping up: A Step forward with Step-Back Prompting

When looking through the latest research around prompt engineering, we always look for methods that are both easy to implement and effective. That’s why we love this method, along with other prompt engineering methods like “According to” and EmotionPrompt.

The effectiveness in this method lays in the simplicity. Hopefully this helps you get better and more reliable outputs from LLMs!

Dan Cleary
Founder