Reducing Latency with Skeleton of Thought Prompting

Most of the research around Large Language Models (LLMs) and prompt engineering focuses on improving the quality of answers. We’ve covered a bunch of different prompt engineering methods (multi-persona prompting, tree-of-thoughts prompting, and according to prompting) that are of varying degrees of successful at this.

That is why this paper from the Microsoft and Tsinghua University caught my eye. (You can check it out here). It introduces a new prompting method called Skeleton of Thought (SoT).

The SoT method is different from other methods because it is built not just to get better outputs, but to make the LLM work faster and more efficiently. Before we dive into this method, let's talk about a few challenges LLMs face.

‍

Image of prompt flow for Skeleton of Thought Framework — Skeleton-of-Thought process

‍

Performance challenges with LLMs

LLMs are extremely powerful, but they have their faults when it comes to latency and efficiency.

Latency Issues: When a model generates an output, it returns one token at a time (more on tokens here). This is the extremely time consuming decoding phase.
‍
Resource Under-utilization: LLMs run on Graphics Processing Units (GPUs). GPUs are designed to handle multiple tasks at once. Since tokens are generated step-by-step, GPU power is often underused.
‍
Complexity and Efficiency Trade-off: The larger you make the model, the more it will “know” but the more it will have to sift through when generating outputs.

How Skeleton of Thought Prompting works

The Skeleton of Thought framework looks to reduce latency by enabling parallel processing.

At the heart of SoT is the idea of segmenting output generation. Instead of generating a response in a straight line, SoT divides the content into distinct segments.

These segments are processed simultaneously, allowing for multiple parts of an answer to be crafted at once. It's like writing several sentences of a paragraph in parallel, rather than sequentially (this has drawbacks, which we'll touch on later).

Anatomy of the SoT Framework: Understanding the Prompts

SoT uses two prompts to guide the LLM to generate an output efficiently.

1. Skeleton prompt
The process begins with an initial prompt that instructs the model to produce a structured skeleton of the intended answer. Kind of like bullet points, or an outline.

‍

💬 Prompt: You’re an organizer responsible for only giving the skeleton (not the full content) for answering the question. Provide the skeleton in a list of points (numbered 1., 2., 3., etc.) to answer the question. Instead of writing a full sentence, each skeleton point should be very short with only 3∼5 words. Generally, the skeleton should have 3∼10 points.

Question: What are the typical types of Chinese dishes?
Skeleton:

Dumplings.
Noodles.
Dim Sum.
Hot Pot.
Wonton.
Ma Po Tofu.
Char Siu.
Fried Rice.

Question: What are some practical tips for individuals to reduce their carbon emissions?
Skeleton:

Energy conservation.
Efficient transportation.
Home energy efficiency.
Reduce water consumption.
Sustainable diet.
Sustainable travel.

Now, please provide the skeleton for the following question.
{{question}}
Skeleton:

‍2. Point-Expanding Stage
Next, the LLM is prompted to expand on each point from the list. This expansion happens in parallel, enabling those latency gains we discussed earlier. For models like OpenAI’s this would mean calling their API multiple times for each item in the list.

‍

We put together a template so you can try out this method easily in PromptHub (link here).

If you don't have PromptHub access but want to try it out, reply to the email that gets sent when you join the waitlist and I'll share an access code with you.

‍

Image of the skeleton of thought template in the PromptHub platform

‍

Experiments: Setup

The researchers put SoT to the test with a few experiments. The goal was to investigate how SoT reduces the end-to-end latency across different models and question types.

These experiments consisted of a wide range of tasks from code generation to complex, multi-faceted writing.

Datasets: Vicuna-80 dataset, which consists of 80 questions spanning nine categories.

Models: 11 models, 9 open-source and 2 API-based models.

Benchmarks: SoT was compared to other typical prompting methods

The results

Speed-up breakdown: Models

The first experiment was designed to see how SoT reduced latency on different models.

‍

2 bar graphs depicting the speed-up affect of SoT prompting

‍

What jumps out is that SoT obtains a >2x speed-up in 6 out of 11 models.

Speed-up breakdown: Question categories

Next, the researchers broke down the speed-up gains by question category.

‍

Bar graph of speed-up results broken down by category — Red categories=poor output quality

‍

Latency Breakdown: SoT stages

The graph below presents the absolute latencies of normal and SoT-generated responses.

‍

Latency breakdown across stages, by model and by category — Breaking down SoT's latency gains into stages

‍

The decoding (token generation) phase accounts for the majority of the end-to-end latency.

Overall Quality

Let’s take a look at how SoT compares to normal generation when it comes to quality of output. To compare the answer quality of normal prompting to SoT, the researchers used two LLM-based evaluation frameworks: FastChat and LLMZoo.

Each answer is presented to an LLM judge (ChatGPT3.5 in this case) and asked for its preference.

‍

A bar graph of overall quality results of win, ties, and loses for normal prompting versus SoT — Win, tie, and lose percentage of SoT compared to normal prompting

‍

As we can see, SoT performs better than or equal to normal prompting ~80% of the time.

Quality Breakdown: Question Categories

Let’s see how SoT performs across different question categories.

‍

2 bar graphs comparing how the SoT method performs in specific typesof questions — SoT does poorly in tasks that require coherent, step-by-step thinking

‍

SoT performs relatively well on generic, common-sense, knowledge, roleplay, and counterfactual. SoT performs relatively badly on writing, fermi, math, and coding.

Let’s take math as an example. Math questions require step-by-step thinking. Without knowing the previous steps, it is going to be really hard to figure out the next step. A method like Tree of Thoughts would perform better here. In contrast, SoT requires the model to come up with the skeleton of the solution from the start and then figure out each individual step independently without referring to previous results.

Looking at the categories that SoT performed well on (Counterfactual, knowledge, common sense, generic), they all have the same characteristic: the ideal answer should cover several relatively independent points.

SoT performs well when the question can be answered in several points whose details can be expanded independently. If the question requires step-by-step thinking, SoT will perform poorly.

Results Quality Breakdown: Metrics

Last up, the researchers looked at which aspects of SoT can either enhance or detract from the quality of the answers.

‍

Bar chart showing win, tie and lose for SoT across metrics — Quality breakdown across different metrics

‍

As we can see, SoT improves the diversity and relevance, while hurting the immersion and coherence of outputs.

Coherence: SoT underperforms because it breaks tasks down into steps which are evaluated independently.
Immersion: SoT has a hard time maintaining a consistent role given the way the framework breaks the answer down into a skeleton.
Diversity: The skeleton stage in SoT encourages LLMs to think from multiple perspectives.
Relevance: In the skeleton stage the model is forced to propose several points related to the specific (relevant) question. In the point-expanding stage, LLMs are required to only discuss these points.

Wrapping up

The Skeleton of Thought framework is mainly focused on reducing latency, rather than increasing quality . This fresh take is interesting but has drawbacks. Combining this approach with other prompt methods could marry the best of both worlds, but parrallel processing of chunks will always have coherency issues.

If you’re interested in other prompt engineering methods I would recommend checking out our other articles, or trying the prompts directly in PromptHub.

‍

Happy prompting!

Dan Cleary

Founder

Reducing Latency with Skeleton of Thought Prompting

Performance challenges with LLMs

How Skeleton of Thought Prompting works

Anatomy of the SoT Framework: Understanding the Prompts

Experiments: Setup

The results

Results Quality Breakdown: Metrics

Wrapping up

Get the week's best prompt engineering and AI content

Join thousands of AI builders

More from the PromptHub Blog

How to Get Better Outputs from GPT-5

Why Long Context Windows Still Don't Work

Feature Launch: Pipelines