Chain-of-Thought (CoT) is a prompt engineering method that guides Large Language Models (LLMs) to produce intermediate reasoning steps along the way when producing an answer.

CoT is arguably the most effective prompt engineering method that is in practice today.

For example, a 0-shot CoT prompt might include the phrase “think step-by-step” at the end of the prompt to encourage the model to output its reasoning steps.

Few-shot CoT provides examples of the reasoning process, leveraging in-context learning to help the model. A 1-shot CoT would look something like this:

Having CoT examples is great, but it typically requires gathering examples, labeling those examples, and maybe implementing some sort of retrieval mechanism. What if we could automate the example generation process?

That’s where a recent paper from Google and Stanford comes in: LARGE LANGUAGE MODELS AS ANALOGICAL REASONERS. This new method, Analogical Prompting, addresses these challenges by letting the model  self-generate the examples automatically.

What is Analogical Prompting

Analogical Prompting is inspired by analogical reasoning, which is the process we humans use to draw relevant past experiences when we take on new problems.

Analogical Prompting enables the model to self-generate examples for problem-solving, eliminating the need for teams to manually curate and label examples.

Here’s an example and a template:

4 prompt examples laid out next to each other

Prompt template in the PromptHub platform

The idea that underpins this prompting approach is that modern LLMs already have the knowledge to generate examples for a wide range of potential problems/tasks.

Looking at the template, an important component is the emphases on the examples being “relevant and distinct”. Diverse examples help the model avoid overfitting its response to the examples.

Why three examples? This will be important if you use any form of in-context learning! The researchers tested Analogical Prompting with a varying number of examples. The final answer they came to was that 3-5 examples was the sweet spot. This aligns with results from other research papers as well.

A table showing the performance when generating different numbers of examples
Results from the Analogical Prompting paper

a graph showing how performance changers as you add examples in context
Source: Language Models are Few-Shot Learners

Leveraging a knowledge creation step

For some complex tasks that have a wide solution space, like code generation, LLMs may overfit their solution based on the examples provided.

To mitigate this, the researchers added an additional step to the process. The step involves prompting the model to take a step back and generate high-level information before solving the task. Similar to Step-Back Prompting, this enables the model to think more broadly and abstractly before generating the examples and solving the initial problem.

For example, in a code generation task, the model might initially summarize overarching programming concepts before diving into the problem at hand. This primes the model for a more informed approach to generating examples and a final solution.

The researchers refer to this as “knowledge” and they put it into practice by adding this instruction to the template.

“# Tutorial: Identify core concepts in the problem and provide a tutorial.”

The knowledge generation step provided the biggest gains on code generation tasks. For simpler tasks, the gains were less significant. The easier the task, the less you need a knowledge generation step.

Following a similar pattern to the Step-Back prompting method, performance increased when the knowledge generation step occurred before the examples

A table showing the different performance when you have knowledge before and after examples

Generating vs retrieving CoT Examples

Generating examples has a few advantages compared to retrieving examples:

  • It gets rid of the need to create, label or retrieve examples, speeding up the prompt engineering process
  • No need to set up a retrieval step/RAG pipeline
  • In some cases, generated examples may actually be better tailored to the task because it can lean on the entire pre-training data of the model

But retrieving examples has benefits as well:

  • Potentially more reliable. Examples retrieved from a labeled dataset have been cherry-picked and validated. Generated examples lack this guarantee.

Experiment set up

The researchers tested Analogical Prompting across a range of reasoning-intensive tasks like solving complex math problems, generating code and more.

Datasets: GSM8K, MATH, code from Codeforces.com, BIG-Bench

Models: GPT-3.5-turbo, GPT-4, PaLM 2-L

Methods:
0-shot prompting: Just a normal prompt “Solve the following math problem”

0-shot CoT: “Solve the following math problem, think step by step”

Few-shot CoT: Standard few-shot CoT, using a fixed number of reasoning examples (3 or 5, depending on the dataset)

Few-shot retrieved CoT: Rather then using a fixed set of examples, examples are dynamically retrieved based on the problem at hand

Analogical Prompting (”Ours”): Self generate 3 or 5 examples based on the data set

Experiment results

GSM8K Dataset

A table showing experiment results of the different methods on the GSM8K dataset

  • Analogical Prompting outperforms all other methods
  • The improvement is more dramatic on the MATH dataset

Here is an example from the Math set

4 prompt examples side by side

  • You can see in the example above that Analogical Prompting generated a geometry example for a geometry problem
  • I don’t love the Few-shot example here. I get their point, which is that getting relevant examples requires labeling and can lead to examples that aren’t perfectly tailored. While the example injected is math related, and is pulled from the dataset, it doesn’t match the problem specifically (geometry). This could certainly occur in production applications, but it feels a little like comparing apples and oranges.

Codeforces Dataset

A table showing experiment results of the different methods on the Codeforces dataset
For complex tasks, like coding, the researchers added a step in the prompt for the model to ground itself with some knowledge

  • Analogical Prompting outperforms the baselines for both models
  • Self-generating the knowledge provides a boost in performance (10% in some cases!)
  • With Analogical Prompting + knowledge, GPT-3.5-turbo is able to get within 1% of GPT-4

An example prompt and output from the study
Prompt example with knowledge. The generated knowledge + examples are relevant and diverse

Big-Bench Dataset

A table showing experiment results of the different methods on the Big-Bench dataset
Results from the Big-Bench dataset

Model Breakdown

A table showing experiment results of the different methods on the GSM8K dataset when varying model size

The researchers tested Analogical Prompting across various model sizes. The main takeaway is that the method’s performance scales with model’s training data size. As the model size increases so does its ability to self-generate relevant and useful examples.

Error analysis

Alright that all sounds great, but what about where Analogical Prompting falls short?

Here’s a breakdown of 50 problems where Analogical Prompting fell short

  • (10/50) Generated exemplars are irrelevant
  • (12/50) Generated exemplars are relevant but contain incorrect solutions
  • (28/50) Generated exemplars are relevant and correct, but LLM fails to solve the new problem:
    (12/50) A generalization gap between the exemplars and the new problem.
    (8/50) Overfit based on specific exemplars
    (8/50) Other issues

In most cases, the generated examples were somewhat relevant or correct. The most common fail case was when the LLM couldn’t solve the new problem due to a generalization gap. Simply, the new problem was harder than the examples generated.

Another thing to note is that Analogical Prompting is going to be more expensive than typical few-shot prompting because Analogical Prompting examples are generated with output tokens compared to input tokens.

Wrapping up

I really like this prompting method because it is easy to plug and play. No need to find relevant examples, label data, or anything like that. Additionally, as models keep getting better and smarter, this method should continue to become more effective.

Dan Cleary
Founder