The Tree of Thoughts (ToT) prompting method made its way on the scene in May of 2023, when a team of Princeton and DeepMind researchers published “Tree of Thoughts: Deliberate Problem Solving with Large Language Models”
Tree of Thought prompting enables LLMs to make decisions by considering multiple paths and evaluating choices to decide the next course of action. Additionally, Tree of Thought prompting enables the LLM to look ahead or backtrack when necessary.
New prompting methods are always exciting, but understanding why these methods out-perform others can be difficult. Let’s take a high-level look at the experiments in this study, the results, and some insights on implementing this technique.
We also built this prompt template in PromptHub, so you can try it out today with just a click (linked later on this article).
What is Tree of Thought Prompting?
Tree of Thought (ToT) prompting is a prompt engineering method that aims to improve problem-solving by structuring the reasoning process like a decision tree. Unlike Chain of Thought prompting, which progresses linearly, Tree of Thought explores multiple reasoning paths in parallel. This allows LLMs to evaluate and compare different solutions to a problem, selecting the best reasoning path.
The Tree of Thoughts Prompting framework
Tree of Thought prompting builds on a concept popularized by Daniel Kahneman in his book Thinking, Fast and Slow. The core concept is that humans have a “dual process” when thinking, a fast, automatic, unconscious mode (“System 1”) and a slow, deliberate, conscious mode (“System 2”).
System 1: Automatically reaching for the light switch when entering a dark room
System 2: Solving a multi-step math problem
Tree of Thought prompting attempts to leverage system 2 in the same way our brains do.
Tree of Thought prompting vs other prompt engineering methods
Think about a time when you were asked to complete a complex task, like a puzzle or brain game. You probably looked at the overall goal, chose the best route and started going that way. But maybe you realized that your initial route wasn’t going to work. So you had to go back a few steps.
Tree of Thought prompting looks to capitalize on this problem-solving approach, leveraging the ability to backtrack and explore different branches.
This type of thinking highlights 2 shortcomings of other prompt methods, like Chain of Thought.
- Locally, they do not explore different branches of the tree, merely following one route down (see diagram above)
- Globally, they don’t incorporate any type of looking ahead, or backtracking to help evaluate all the different options at hand.
The details of Tree of Thought Prompting
Tree of Thought prompting tries to answer 4 questions.
- How to break down the task into smaller steps
- How to generate different ideas for each step
- How to evaluate if a particular step is good or not
- What type of search algorithm to use to navigate the tree and come to a final conclusion
1. Breaking down tasks into steps
Tree of Thought prompting generates steps to complete the task.
2. Generate ideas
Next, the Tree of Thoughts prompting method generates ideas for each step.
3. Evaluate ideas
Next, the model evaluates the ideas for each step using two main methods.
- Independent Evaluation: The LLM assesses each idea independently, assigning a value or classification (sure/likely/impossible).
- Voting Across Ideas: When direct evaluation is challenging (like in writing tasks), the model compares solutions and votes for the most promising one.
4. Navigate the entire problem-space
Tree of Thought prompting next shifts focus to navigating and evaluating the whole problem-space.
It leverages search algorithms to achieve this. Two common search algorithms used in Tree of Thought prompting are:
- Breadth-First Search (BFS): Explores states level by level, suitable for problems with limited depth and a small initial set of states.
- Depth-First Search (DFS): Focuses on promising states and backtracks if necessary, ideal for complex problems with a larger search space.
Tree of Thought Prompting example
Usually Tree of Thought prompting is broken down into multiple prompts, but we combined them all into a single Tree of Thought template that you can try out in PromptHub here.
If you don't have PromptHub access but want to try it out, reply to the email that gets sent when you join the waitlist and I'll share an access code with you.
Tree of Thought Prompting experiments
There were three experiments run in this paper, spanning different types of tasks. GPT-4 was used across all experiments and 3 prompting techniques were put to the test (Input-Output, Chain of Thought, and Tree of Thought).
Experiment 1: Game of 24
Game of 24 is a game where 4 numbers and basic operations (+-*/) are used to get to 24.
For each task, an output was defined as a success if it was a valid equation that equaled 24, and used all the inputs exactly once.
Results
You’ll see two variables, k and b in the table below.
- k = (Thought Generation): Number of potential ideas generated at each step.
- b = (Search Algorithm): Breadth limit of explored states at each level.
Taking a look at the table below, IO (basic prompting) and both versions of CoT prompting methods perform poorly. Tree of Thought prompting, with a low breadth of 1, already achieved a much higher success rate. That success rate rocketed to 74% when increasing the breadth to 5.
CoT's success rate also increased when given more chains to judge from, but not to the level of Tree of Thought prompting.
Experiment 2: Creative Writing
Next up was creative writing. The model was given four randomly generated sentences and prompted to output a coherent passage with four paragraphs that end with one of the four sentences.
Setup
100 sentences were genereated.
The coherency of the output was judged in two ways
- Using a GPT-4 zero-shot prompt to provide a 1-10 score
- Judged by human authors in a blind study.
Results
The graph on the left shows the average scores given by GPT-4.
- Tree of Thought prompting: 7.56
- CoT: 6.93
- IO: 6.19
The author scores are on the right and further confirm the initial findings. The authors preferred Tree of Thought prompting over CoT 41 out of 100 times, while they only preferred CoT over Tree of Thought prompting 21 times (the other 38 pairs were a draw).
Experiment 3: Mini-crosswords
Mini-crosswords is a much more challenging problem that will require Tree of Thought prompting to produce more branches and steps. Let's see how it holds up.
Setup
For each task, the input includes 5 horizontal and 5 vertical clues, and expected output is a 5x5 board to solve the crosswords.
A depth-first search algorithm was used to explore promising word clues and backtrack when needed.
If a clue is deemed 'impossible' to fill, the subtree exploration is pruned, and DFS backtracks to the parent for the next promising thought.
Results
As you can see below, IO and CoT methods perform poorly with a word-level success rate below 16%. Tree of Thought prompting significantly improves all metrics, achieving a 60% word-level success rate and solving 4 out of 20 games. This improvement is largely due to the fact that IO and CoT lack mechanisms to try different clues, make changes, or backtrack.
Wrapping up
While there are tons of prompting methods out there, I haven’t found one that rules them all. This highlights the importance of testing and comparing outputs from different methods.
If you want to test out the Tree of Thoughts method, or other advanced methods like Multi-Persona Prompting, feel free to use our free templates.
Happy prompting!