The Tree of Thoughts (ToT) prompting method made its way on the scene in May, when a team of Princeton and DeepMind researchers published “Tree of Thoughts: Deliberate Problem Solving with Large Language Models”
ToT enables LLMs to make decisions by considering multiple paths and evaluating choices to decide the next course of action. Additionally, ToT enables the LLM to look ahead or backtrack when necessary.
New prompting methods are always exciting, but understanding why these methods out-perform others can be difficult. Let’s take a high-level look at the experiments in this study, the results, and some insights on implementing this technique.
We also built this prompt template in PromptHub, so you can try it out today with just a click (linked later on this article).
The Tree of Thoughts framework
ToT builds on a concept popularized by Daniel Kahneman in his book Thinking, Fast and Slow. The core concept is that humans have a “dual process” when thinking, a fast, automatic, unconscious mode (“System 1”) and a slow, deliberate, conscious mode (“System 2”).
System 1: Automatically reaching for the light switch when entering a dark room
System 2: Solving a multi-step math problem
ToT attempts to leverage system 2 in the same way our brains do.
ToT vs other prompting methods
Think about a time when you were asked to complete a complex task, like a puzzle or brain game. You probably looked at the overall goal, chose the best route and started going that way. But maybe you realized that your initial route wasn’t going to work. So you had to go back a few steps.
ToT looks to capitalize on this problem-solving approach, leveraging the ability to backtrack and explore different branches.
This type of thinking highlights 2 shortcomings of other prompt methods, like Chain of Thought.
- Locally, they do not explore different branches of the tree, merely following one route down (see diagram above)
- Globally, they don’t incorporate any type of looking ahead, or backtracking to help evaluate all the different options at hand.
The details of Tree of Thoughts
ToT tries to answer 4 questions.
- How to break down the task into smaller steps
- How to generate different ideas for each step
- How to evaluate if a particular step is good or not
- What type of search algorithm to use to navigate the tree and come to a final conclusion
1. Breaking down tasks into steps
ToT generates steps to complete the task.
2. Generate ideas
Next, the ToT method generates ideas for each step.
3. Evaluate ideas
Next, the model evaluates the ideas for each step using two main methods.
- Independent Evaluation: The LLM assesses each idea independently, assigning a value or classification (sure/likely/impossible).
- Voting Across Ideas: When direct evaluation is challenging (like in writing tasks), the model compares solutions and votes for the most promising one.
4. Navigate the entire problem-space
The ToT method now shifts focus to navigating and evaluating the whole promblem-space.
It leverages search algorithms to achieve this. Two common search algorithms used in ToT are:
- Breadth-First Search (BFS): Explores states level by level, suitable for problems with limited depth and a small initial set of states.
- Depth-First Search (DFS): Focuses on promising states and backtracks if necessary, ideal for complex problems with a larger search space.
Usually ToT is broken down into multiple prompts, but we combined them all into a single template that you can try out in PromptHub (link here).
If you don't have PromptHub access but want to try it out, reply to the email that gets sent when you join the waitlist and I'll share an access code with you.
There were three experiments run in this paper, spanning different types of tasks. GPT-4 was used across all experiments and 3 prompting techniques were put to the test (Input-Output, CoT, and ToT).
Experiment 1: Game of 24
Game of 24 is a game where 4 numbers and basic operations (+-*/) are used to get to 24.
For each task, an output was defined as a success if it was a valid equation that equaled 24, and used all the inputs exactly once.
You’ll see two variables, k and b in the table below.
- k = (Thought Generation): Number of potential ideas generated at each step.
- b = (Search Algorithm): Breadth limit of explored states at each level.
Taking a look at the table below, IO (basic prompting) and both versions of CoT prompting methods perform poorly. ToT, with a low breadth of 1, already achieved a much higher success rate. That success rate rocketed to 74% when increasing the breadth to 5.
CoT's success rate also increased when given more chains to judge from, but not to the level of ToT.
Experiment 2: Creative Writing
Next up was creative writing. The model was given four randomly generated sentences and prompted to output a coherent passage with four paragraphs that end with one of the four sentences.
100 sentences were genereated.
The coherency of the output was judged in two ways
- Using a GPT-4 zero-shot prompt to provide a 1-10 score
- Judged by human authors in a blind study.
The graph on the left shows the average scores given by GPT-4.
- ToT: 7.56
- CoT: 6.93
- IO: 6.19
The author scores are on the right and further confirm the initial findings. The authors preferred ToT over CoT 41 out of 100 times, while they only preferred CoT over ToT 21 times (the other 38 pairs were a draw).
Experiment 3: Miniscrosswords
Minicrosswords is a much more challenging problem that will require ToT to produce more branches and steps. Let's see how it holds up.
For each task, the input includes 5 horizontal and 5 vertical clues, and expected output is a 5x5 board to solve the crosswords.
A depth-first search algorithm was used to explore promising word clues and backtrack when needed.
If a clue is deemed 'impossible' to fill, the subtree exploration is pruned, and DFS backtracks to the parent for the next promising thought.
As you can see below, IO and CoT methods perform poorly with a word-level success rate below 16%. ToT significantly improves all metrics, achieving a 60% word-level success rate and solving 4 out of 20 games. This improvement is largely due to the fact that IO and CoT lack mechanisms to try different clues, make changes, or backtrack.
While there are tons of prompting methods out there, I haven’t found one that rules them all. This highlights the importance of testing and comparing outputs from different methods.
If you want to test out the Tree of Thoughts method, or other advanced methods like Multi-Persona Prompting, feel free to use our free templates.