Sleep-Time Compute

You’ve probably heard of test-time compute before. That’s what happens when a reasoning model prints its chain of thought. In other words, the model is doing extra compute (reasoning) after you send the prompt. But what if you could have the model reason ahead of time?

Enter sleep-time compute, a new concept from researchers at Letta.

With sleep-time compute you prompt the model to pre-process context (documents, a codebase, conversation history, etc) ahead of time to essentially create a really dense summary with information that you think users are very likely to ask about. It’s an interesting concept that sounds like prompt caching and sounds like RAG but is different from both.

So, does this advance planning actually save tokens and latency without hurting accuracy? That’s what the rest of this article will unpack. Let’s dive in and see.

‍

What’s up everyone, you’ve probably heard of test-time compute before—but what if LLMs could think and reason while you sleep? That’s what the latest paper from the team at Leta is proposing. Leta started as a research paper and is now a full-fledged company. Their new paper, *Sleep Time Compute: Beyond Inference Scaling at Test Time*, introduces a way for models to “think” offline before a user query even arrives, effectively reducing cost and latency at inference. Here’s the problem: when using a reasoning model, test-time compute can be slow and expensive. The model has to reason over long context in real time—sometimes taking several minutes for a single task. Sleep time compute addresses this by letting the model process context *ahead of time*—during idle or “sleep” time. ### How it works: Imagine an app that processes a codebase or reviews financial docs. During sleep time, you send the model all the relevant context (e.g. files, conversation history, financial statements). Then, you use a **precompute prompt** to extract likely information the user might ask about (e.g. last quarter’s revenue). This produces a **dense summary** of important info, based on likely user needs. This dense summary is then prepended to the prompt at test time. That way, instead of reasoning over all the raw context again and again, the model just references the precomputed summary. This is **not** prompt caching (you’re not storing entire responses), and it’s **not** RAG (you’re not doing semantic retrieval). ### When should you use it? - The context is long-lived (e.g. persistent codebases) - Multiple queries hit the same context - You want to reduce cost or latency - User queries are predictable ### Performance Across multiple datasets and model types, sleep time compute consistently outperformed standard approaches—achieving the same or better accuracy with fewer tokens. For example, in one experiment with Claude 3.5 Sonnet, the model using sleep time compute hit the same accuracy with 11k tokens that the baseline needed 20k tokens for—and the baseline never fully caught up. ### Prompt & Tool Design Here’s a simplified example of the prompt pattern used: **Prompt**: > You are an offline memory agent. Your task is to reorganize and consolidate memories by calling the `rethink_memory` tool at every step... The `rethink_memory` tool is used to integrate new facts into memory. The model uses it iteratively until complete, then signals it’s done with a separate tool call (`finished_thinking`). ### Case Study: SWE Repair In a simulated developer agent scenario: - The agent is given a repository and a 7-file PR with an issue. - With baseline prompting, it sees the whole context at once. - With sleep time compute, it gets the same context, but also a condensed summary generated beforehand. - This approach used **3,000 fewer tokens on average**, while producing better repair plans. ### Final Thoughts Sleep time compute is essentially a clever use of prompting and tool-use to “pre-digest” large context. It’s not about shoving everything into a long context window—it’s about using idle time wisely to anticipate needs, prep context, and reduce redundancy. Amazing work from Leta. The paper will be linked below. Definitely worth a read if you’re building with LLMs or exploring efficient architecture patterns. See you in the next one!

‍

What is Sleep-Time Compute?

Sleep-time compute is a technique that lets a language model pre-process its context during idle time so it can answer future questions with fewer tokens.

Instead of waiting for a user question and sending that to the model with the relevant context and using reasoning tokens, you prompt the model ahead of time, during its idle (sleep) periods, to go through the standing context (docs, codebase, etc.) and generate a tight, inference-rich summary it can pull from later. It works partcilarly well if you know, generally, the information your users are going to be asking questions about.

‍

Two side by side graphics comparing sleep time compute and test time compute

‍

This isn’t prompt caching because the context itself is getting rewritten rather than cached verbatim. It’s also not RAG, because there is no vector database or retrieval going on.

The payoff is that, when a user sends a question, the model can answer faster and with higher accuracy by pulling from the condensed summary rather than having to spend a lot of time reasoning over the full context.

Below is the prompt used to generate the dense summaries used for sleep-time compute. Access it directly in PromptHub here.

‍

Sleep time compute prompt template in PromptHub

‍

How Sleep-Time Compute Works

The sleep-time compute prompt has access to two tools.

‍

Rethink_memory (new_memory, target_block, source_block)

Merge new facts from source_block into the running summary stored in target_block.

‍

Rethink_memory tool description — rethink_memory tool

‍

How `rethink_memory` works

Pick a source block
The model selects one memory block (e.g., persona, human, or a chunk of context) whose information hasn’t yet been merged into the running summary.
Integrate & rewrite
It generates a new_memory string by combining the old summary with any new facts, inferences, or corrections from that source block. Redundant lines are removed and outdated statements are updated to reflect the most likely truth.
Write back
The updated string is written into the target block, usually the “rethink_memory block” that holds the growing, condensed summary.
Repeat or stop
The model repeats steps 1–3 as many times as needed. When no further improvements can be made, it calls finish_rethinking_memory() to end the loop.

‍

Finish_rethinking_memory

Ends the loop when nothing useful remains to integrate.

‍

Finish_rethink_memory tool description — finish_rethinking tool

‍

End-to-end flow

Idle period – Model receives the sleep-time prompt and iteratively calls rethink_memory() until the summary is dense, consistent, and inference-rich.
Loop ends – Model calls finish_rethinking_memory().
Test time – System prepends the condensed summary to the user’s question; the model now needs only a short chain-of-thought to answer, slashing live tokens and latency.

By shifting heavy merge-and-reason work to idle moments, sleep-time compute typically cuts live token budgets by about 5 × while often boosting accuracy.

‍

Experiment Setup

Before we look at the numbers, here’s how the researchers set up the experiments.

They built stateful versions of standard benchmarks. Stateful here means each problem is split into a persistent context and a separate question, so the same context can support multiple follow-up queries (see image below). Then they picked a variety of popular models and ran each one in two modes:

Test-time compute
Sleep-time compute

‍

a graphic showing a query and context split

‍

Datasets
- Stateful GSM-Symbolic (P1 & P2) – GSM8K problems split into context + query
- Stateful AIME (’24 & ’25) – AIME problems split into context + query
- Multi-Query GSM-Symbolic – each context is paired with up to ten synthetic follow-up questions, so one sleep-time summary can be reused.
- SWE-Features – pull-request fixes from SWE-bench to test an agentic coding workflow.
Models
- o1, o3-mini
- GPT-4o, GPT-4o-mini, Claude 3.7 Sonnet, DeepSeek-R1
Inference modes
- Baseline: test-time compute only. For non-reasoning models, specific prompts were used to elicit more or less reasoning.
- Sleep-time compute: context rewritten during idle time, tiny token budget at answer-time.
Metrics & sweeps
- Accuracy (math) or F1 (SWE-Features) vs. avg. test-time tokens/question.
- Extra sweeps: parallel summaries k = 1, 2, 5, 10; predictability bins; 1→10 questions per context.

Experiment Results

Alright now let’s jump into the results, starting with Stateful GSM-Symbolic (P1 & P2), a math dataset.

1. Math benchmarks:

‍

‍

On both P1 and P2 splits, sleep-time compute (blue) reaches the same accuracy with roughly 5 × fewer test-time tokens than the plain baseline (grey).
Everything inside the blue wedge represents “free” gains, either higher accuracy at the same cost or the same accuracy at a lower cost.

‍

‍

Across four models, the 5 × saving continues
Only o1 surpasses the STC curve, and only when you let it spend 8k–10k tokens per question.

‍

2. Scaling sleep-time compute itself

The researchers ran the sleep-time prompt k times in parallel, producing k independent summaries, and then prepended all of them before the user’s query at test time. On the plot, each curve’s color (from light gray to deep blue) indicates a different k value.

‍

Two graphs side by side for scaling sleep time compute

‍

At the same answer-time token budget, increasing k from 2→5 boosts accuracy by up to +13 pp on GSM-P1 and +18 pp on GSM-P2. Beyond k = 5, returns diminish.

Key takeaways

Parallel passes increases performance
k ≈ 5 is optimal, returns start to diminish there afterwards.
Zero extra inference cost. All performance gains come from the additional prepended summaries. Test time tokens stayed fixed.

‍

3. Predictability and amortization

The more predictable the user questions are, the better sleep-time compute performs. This is becasue you can tune the sleep-time compute prompt to specifically address the very preditibale and frequent quetsions.

predictability versus accuracy for test time compute

‍

The bigger the gap: in the top predictability bin sleep-time compute beats the baseline by ≈ 0.55 accuracy on GSM-P1.

4. Agentic coding benchmark

‍

graph for swe-bench agentic coding benchmark

‍

On SWE-Features (more on this later), Claude 3.7 Sonnet with sleep-time compute outperforms normal processeing up until the thinking budget crosses ~8,500.

‍

Key takeaways

5 × cheaper answers on both math benchmarks at the same accuracy.
+13–18 pp accuracy when allowing for multiple runs of the sleep-time compute prompt.
Bigger wins when questions are predictable or when several queries share the same context.
Agent workflows benefit up until ~8,500 reasoning tokens
Overall, sleep-time compute increases accuracy and decreases costs almost everywhere you have reusable context.

When should you use sleep-time compute?

Use it when the context remains steady and speed or cost per query really matter.

Long-lived context: Documents, large codebases, chat histories persist across sessions.
Multiple expected queries: The same background will be referenced more than once, so you can amortize the prep work.
Tight latency or cost targets: You need fast responses and want to keep token spend low.
Predictable queries: You have some sense of the types of questions users will ask, which allows you to tune your sleep-time compute prompt and produce better summaries.

You can skip sleep-time compute for:

One-off questions - you’ll never ask about that context again.
Wildly open-ended asks - advance summaries are likely to not have the right context.
Ultra-hard, bespoke tasks - the model needs a full, long chain-of-thought tied exactly to the single question.

‍

SWE-Features Case Study

Setup

SWE-Features is the SWE-bench slice focused on feature-addition pull requests.

For the example task the agent gets:

the entire aider repo (Python)
a seven-file PR that introduces a new -issue flag
a six-step repair prompt telling it to explore, plan, patch, test, and submit

Baseline vs. Sleep-Time Setup

Baseline Run

At answer-time the model sees only the raw context:
- Uploaded files & working directory
- PR description
- Six-step repair plan
It explores, plans, and patches in one go, burning ≈ 10 k live tokens.

Sleep-Time Compute Run

During idle time, the model sees the raw context plus an “exploration” prompt and builds a one-page summary.
At answer-time, the model sees:
- The same raw context (uploaded files, working dir, PR description, 6-step plan)
- The condensed summary prepended on top
It follows the identical six-step plan but only needs ≈ 3 k live tokens to answer.

‍

‍

With sleep-time compute, Claude 3.7 Sonnet reaches the same F1 score that the baseline only achieves after a 10 k-token chain-of-thought, with a 3× reduction in live-token spend and latency.
Baseline only pulls ahead when it's allowed to user over 10k reasoning tokens per patch.

‍

‍Why It Matters

Code-repair agents spend most of their budget exploring the repo, not editing files. Moving that exploration into sleep-time lets the model ship fixes faster and cheaper whenever follow-up questions hit the same repository snapshot.

‍

Financial-Analyst chatbot example

Let’s look at another example for how a financial-analyst chatbot can use sleep-time compute to digest large financial documents ahead of time and answer market-day questions in milliseconds.

1. Standing context

Every night the bot downloads Acme Corp’s latest financial documents (10-Ks, 10-Qs, web-searches, etc).

2. Sleep-time pass

Ahead of time, we will have the model extract key items from the financial documents that we think our users are most likely to ask questions about.

Sleep-time compute prompt example:

You are OfflineMemory-Finance. From the documents below extract: • Fiscal-year revenue, gross margin, operating income, free cash flow. • Year-over-year deltas for each metric. • All new debt covenants or risk factors. • Altman-Z-Score and quick ratio. Call rethink_memory(<facts>) as many times as needed (max 10); finish_rethinking() when no new information can be added.

3. Stored summary example

The output from the prompt above may be something like this:

FY-2024 revenue $18.6 B (+8 % YoY)
Gross margin 42 % (-1 pp YoY)
Free cash flow $2.1 B (+12 % YoY)
Net debt $4.2 B; new covenant: max Net-Debt/EBITDA < 3.0×
Altman-Z-Score 3.1 (“safe” zone)
Key risk added: chip-supply constraints could raise COGS.

4. Outcome

Parsing tables, footnotes, and ratio math happens once, overnight. During market hours, each analyst query can be answered in sub-second time by starting from the condensed summary instead of reprocessing the full 10-K.

Why this matters

SEC filings update only once a year, but analysts fire off dozens of follow-ups daily. Sleep-time compute turns a heavy 10-K scrape into a reusable knowledge capsule, delivering sub-second replies and slashing inference costs, all without a separate vector DB or retrieval layer.

Conclusion

Sleep-time compute is another way to try to increase peformance of LLM applications. By distilling your context into a reusable summary, it amortizes compute across multiple queries and delivers faster, more consistent results.

On the flip side, summaries can go stale if the underlying context changes. One-off or unpredictable questions may fall outside the pre-computed summary, and at very high live-token budgets a clean, focused chain-of-thought can sometimes edge ahead.

Overall, for any workflow where the same context is queried repeatedly under tight cost or latency constraints, sleep-time compute offers a high-leverage trade-off, but you'll still to need monitor freshness and have fallbacks for unexpected asks.

Dan Cleary

Founder

Sleep-Time Compute

What is Sleep-Time Compute?

How Sleep-Time Compute Works