Enhance Performance of Any Model Through Automated Reasoning

The launch of OpenAI’s o1 models and Google’s Gemini 2.0 Flash Thinking Mode has placed ‘reasoning’ models firmly at the top of AI benchmarks and leaderboards. The biggest change with these models is their ability to automatically generate of Chain of Thought (CoT) reasoning steps at inference time. It has even changed how we should approach prompt engineering for these types of models (more info here: Prompt Engineering with Reasoning Models).

Traditional CoT methods, like writing out reasoning chain examples for few-shot prompting, are effective but require significant effort to craft high-quality chains. Zero-Shot CoT, with its simpler “Let’s think step by step” prompt, doesn’t always succeed in eliciting effective reasoning chains and can even degrade performance in some cases.

Over two years ago, Auto-CoT emerged as a solution to automate the generation of reasoning chains. As we discussed in our article about Chain of Thought prompting, Auto-CoT offered a way to streamline the process, but it required complex setup, including clustering, retrieval of diverse examples, and more. While innovative, its complexity limited its practical application.

More recently, a new framework called AutoReason has come onto the scene, offering a simpler yet still effective approach.

AutoReason is a 2-step framework that first generates reasoning chains for any query, and then includes the original query and reasoning chain in a second prompt for a final generation. Like OpenAI’s o1 models, AutoReason removes the need for any manual work in creating reasoning chains.

In this article, we’ll explore how AutoReason works, compare it to approaches like Auto-CoT, and share templates so you can get started testing it out right away.

‍

Hey everyone, how's it going? Dan here, co-founder of PromptHub. Hope you're having a great holiday season. Today, we’re diving into automating Chain of Thought (CoT) reasoning and how you can apply this to any model. With recent updates like OpenAI’s Gemini Flash 2.0 and its thinking mode, reasoning models are becoming more integrated into inference time. While these capabilities are baked into some models, there’s still value in manually implementing reasoning frameworks for tasks requiring deeper reasoning.

Introduction to Chain of Thought Reasoning

Chain of Thought reasoning helps models break down tasks into logical steps, improving reasoning performance. Historically, this was achieved through:

Few-Shot CoT: Providing reasoning examples for tasks (e.g., math problems).
Directives: Using prompts like “Let’s think step by step.”

These methods, while effective, can vary significantly across models. Recent reasoning models incorporate these steps automatically but lack transparency, making debugging challenging.

Introducing the Auto Reason Framework

The Auto Reason framework automates the creation of reasoning chains or traces for any given task. It’s a simple, two-prompt approach:

Step 1: Generate reasoning steps for the query with a few-shot example included.
Step 2: Use the reasoning traces along with the original query to generate the final output.

A key advantage is the ability to use a stronger model for reasoning step generation and a smaller, cost-effective model for the final output. This mimics OpenAI’s approach with its 01 models to optimize cost and latency.

Auto Reason in Practice

The framework has been tested on two datasets:

HotpotQA: Simple yes/no style questions where Auto Reason improved performance for smaller models but had mixed results for more advanced models like GPT-3.5 Turbo.
StrategyQA: Complex tasks requiring multiple reasoning steps (e.g., “Did Aristotle use a laptop?”). Here, Auto Reason significantly boosted performance for all tested models.

Key takeaway: Use frameworks like Auto Reason for complex tasks, but avoid overengineering reasoning for straightforward tasks where clean, concise instructions suffice.

Comparison with Auto Chain of Thought

The Auto Chain of Thought framework is another popular reasoning approach. It clusters similar questions and retrieves reasoning chains based on semantic similarity during inference time. While effective for generating diverse examples, it requires substantial setup, including:

Data segregation
Clustering
Sampling and retrieval

In contrast, Auto Reason offers simplicity, adaptability, and transparency, making it easier to implement and test for a variety of tasks.

Why Auto Reason?

Auto Reason stands out due to its ease of implementation and flexibility. It’s well-suited for teams looking to enhance reasoning capabilities without the overhead of complex frameworks. With tools and templates available for testing, Auto Reason provides a straightforward path to improving reasoning tasks in AI models.

Conclusion

Automating Chain of Thought reasoning is an effective way to boost performance, especially for challenging tasks. Frameworks like Auto Reason strike a balance between simplicity and effectiveness, allowing you to adapt prompts and models for your specific use case. Check out the linked resources and templates to get started!

‍

The challenges with typical Chain of Thought prompting

Chain of Thought (CoT) prompting has long been one of the better prompt engineering methods when it came to more challenging, multi-step tasks.

Implementing Chain of Thought manually relied on creating task-specific reasoning chains and passing them as few shot examples. Using generic prompts like “Let’s think step by step,” offers a simpler alternative, but it is often less effective in breaking down complex problems into subparts.

Newer models like OpenAI’s o1 and Google Gemini 2.0 Flash Thinking Mode have shifted to automated reasoning, bypassing the need for manual CoT setups. While this has unlocked a variety of use cases that require deeper reasoning, the downside is a lack of visibility into the reasoning process. Without explicit reasoning steps, it’s harder to understand what’s happening under the hood or troubleshoot when things go wrong.

AutoReason steps in as a dynamic framework that not only automates reasoning but also retains interpretability, generating explicit reasoning traces tailored to each query.

‍

AutoReason

How AutoReason Works

AutoReason is a two-step framework that generates reasoning chains and then uses those reasoning chains, along with the initial query to generate better outputs.

Rationale generation: A stronger model, such as GPT-4, creates detailed reasoning traces for a given task or query.
Answer generation: A smaller, cost-effective model, like GPT-3.5-Turbo, uses these rationales to produce the final answer.

‍

‍

The reasoning chains are generated dynamically based on the input query, which makes the framework adaptable and easy to use.

By separating the generation of reasoning steps from the final answer, AutoReason also provides a level of interpretability that models like OpenAI’s o1 currently lack.

Why I like AutoReason

Simplicity: Unlike frameworks like Auto-CoT, AutoReason doesn’t require clustering or dataset retrieval, streamlining its implementation. It’s just two prompts.
Transparency: By generating explicit reasoning steps, you can see and troubleshoot the logic behind outputs.
Cost-effectiveness: If you’re trying to cut costs you can use a stronger model for generating the reasoning chains and a weaker model for generating the final answer

You can try out AutoReason via the template in PromptHub here.

‍

Automatic Chain of Thought Prompt Enhancer

We recently launched prompt enhancers in PromptHub, including an option to generate chain of thought steps for any prompt. We took a look of inspiration from AutoReason when building this out. Feel free to try it out for free in PromptHub - it's available on all plans!

‍

A modal showing a task input and a generated reasoning chain

‍

Experiment results

The researchers evaluated AutoReason on two benchmarks: StrategyQA and HotpotQA. HotpotQA, a simpler dataset, doesn’t require extensive task decomposition. For instance, a question like “Were Scott Derrickson and Ed Wood of the same nationality?” has a straightforward answer: “Yes.”

On the other hand, StrategyQA has more complex questions that demand implicit reasoning. For example, “What is the connection of James Clark Maxwell to bank notifications?” requires breaking down the problem into multiple logical steps.

With that out of the way, lets check out some results

HotpotQA: Fact-Based Tasks

‍

A table of results for AutoReason on the Hotpot dataset

For HotpotQA, AutoReason showed mixed results:

GPT-3.5-Turbo: Baseline (61.6%), CoT (58.3%), AutoReason (76.6%)
GPT-4: Baseline (73.3%), CoT (63.3%), AutoReason (71.6%)

While AutoReason improved results for the weaker model (GPT-3.5), performance degraded for the more advanced model (GPT-4-Turbo).
This is important for anyone writing prompts. Sometimes you can overdo it with prompt engineering, where all you need is clear, crisp, instructions.

StrategyQA: Excelling in Complex Reasoning

‍

A table of results for AutoReason on the StrategyQA dataset

On the StrategyQA dataset, AutoReason outperformed both baseline and traditional CoT prompting:

GPT-3.5-Turbo: Baseline (55%), CoT (70.3%), AutoReason (76.6%)
GPT-4: Baseline (71.6%), CoT (76.6%), AutoReason (91.6%)

By dynamically generating reasoning traces, AutoReason was able to better solve more challenging questions.

The harder the challenge, the better AutoReason performed.

AutoReason vs. Auto-CoT

AutoReason isn’t the first framework for automatically generating reasoning chains and rationales. For example, Analogical prompting solves this, as well as Chain of Verification (CoVe) and Auto-CoT

AutoReason and Auto-CoT take different approaches to accomplishing similar goals.

Auto-CoT: A clustering-based approach

Auto-CoT focuses on creating diverse reasoning demonstrations by clustering questions from a dataset and selecting representative examples. Using the "Let’s think step by step" prompt, it generates reasoning chains for the examples inside the clusters. This enables the examples to be diverse, which is a best practice when doing any sort of few shot prompting.

Requires preprocessing: Dataset creation, clustering and sampling (retrieval) steps demand time and can be challenging to set up.
Best suited for static tasks: Works well for predefined datasets but is less adaptable to dynamic or query-specific tasks.

‍

‍

AutoReason: A simple, prompt only, framework

As mentioned above, AutoReason takes a different approach by dynamically generating reasoning traces for any query without the need for clustering or dataset creation. Its two-step process—leveraging a stronger model for reasoning generation and a weaker model for final answers—provides several advantages:

Easy to implement: No clustering or preprocessing steps required. Just two prompts.
Adaptability: Tailors reasoning traces to individual queries
Transparency: Generates explicit reasoning steps, allowing teams to troubleshoot and understand model outputs more easily.

‍

Conclusion

I love AutoReason because it is simple yet powerful. As a prompt-only framework, it’s easy to implement, highly adaptable, cost-efficient, and offers transparency into reasoning steps. Give it a try using the prompt templates available on PromptHub today!

Dan Cleary

Founder

Enhance Performance of Any Model Through Automated Reasoning

Introduction to Chain of Thought Reasoning

Introducing the Auto Reason Framework

Auto Reason in Practice

Comparison with Auto Chain of Thought

Why Auto Reason?

Conclusion

The challenges with typical Chain of Thought prompting

AutoReason

Automatic Chain of Thought Prompt Enhancer

Experiment results

HotpotQA: Fact-Based Tasks

StrategyQA: Excelling in Complex Reasoning

AutoReason vs. Auto-CoT

Auto-CoT: A clustering-based approach

AutoReason: A simple, prompt only, framework

Conclusion

Get the week's best prompt engineering and AI content

Join thousands of AI builders

More from the PromptHub Blog

How to Automatically Pick the Right Model for the Right Job

Why LLMs Fail in Multi-Turn Conversations (And How to Fix It)

When ‘Thinking’ Models Stop Thinking