You’ve probably heard of test-time compute before. That’s what happens when a reasoning model prints its chain of thought. In other words, the model is doing extra compute (reasoning) after you send the prompt. But what if you could have the model reason ahead of time?
Enter sleep-time compute, a new concept from researchers at Letta.
With sleep-time compute you prompt the model to pre-process context (documents, a codebase, conversation history, etc) ahead of time to essentially create a really dense summary with information that you think users are very likely to ask about. It’s an interesting concept that sounds like prompt caching and sounds like RAG but is different from both.
So, does this advance planning actually save tokens and latency without hurting accuracy? That’s what the rest of this article will unpack. Let’s dive in and see.
What is Sleep-Time Compute?
Sleep-time compute is a technique that lets a language model pre-process its context during idle time so it can answer future questions with fewer tokens.
Instead of waiting for a user question and sending that to the model with the relevant context and using reasoning tokens, you prompt the model ahead of time, during its idle (sleep) periods, to go through the standing context (docs, codebase, etc.) and generate a tight, inference-rich summary it can pull from later. It works partcilarly well if you know, generally, the information your users are going to be asking questions about.

This isn’t prompt caching because the context itself is getting rewritten rather than cached verbatim. It’s also not RAG, because there is no vector database or retrieval going on.
The payoff is that, when a user sends a question, the model can answer faster and with higher accuracy by pulling from the condensed summary rather than having to spend a lot of time reasoning over the full context.
Below is the prompt used to generate the dense summaries used for sleep-time compute. Access it directly in PromptHub here.

How Sleep-Time Compute Works
The sleep-time compute prompt has access to two tools.
Rethink_memory (new_memory, target_block, source_block)
Merge new facts from source_block into the running summary stored in target_block.

How rethink_memory
works
- Pick a source block
The model selects one memory block (e.g.,persona
,human
, or a chunk of context) whose information hasn’t yet been merged into the running summary. - Integrate & rewrite
It generates anew_memory
string by combining the old summary with any new facts, inferences, or corrections from that source block. Redundant lines are removed and outdated statements are updated to reflect the most likely truth. - Write back
The updated string is written into the target block, usually the “rethink_memory block” that holds the growing, condensed summary. - Repeat or stop
The model repeats steps 1–3 as many times as needed. When no further improvements can be made, it callsfinish_rethinking_memory()
to end the loop.
Finish_rethinking_memory
Ends the loop when nothing useful remains to integrate.

End-to-end flow
- Idle period – Model receives the sleep-time prompt and iteratively calls
rethink_memory()
until the summary is dense, consistent, and inference-rich. - Loop ends – Model calls
finish_rethinking_memory()
. - Test time – System prepends the condensed summary to the user’s question; the model now needs only a short chain-of-thought to answer, slashing live tokens and latency.
By shifting heavy merge-and-reason work to idle moments, sleep-time compute typically cuts live token budgets by about 5 × while often boosting accuracy.
Experiment Setup
Before we look at the numbers, here’s how the researchers set up the experiments.
They built stateful versions of standard benchmarks. Stateful here means each problem is split into a persistent context and a separate question, so the same context can support multiple follow-up queries (see image below). Then they picked a variety of popular models and ran each one in two modes:
- Test-time compute
- Sleep-time compute

- Datasets
- Stateful GSM-Symbolic (P1 & P2) – GSM8K problems split into context + query
- Stateful AIME (’24 & ’25) – AIME problems split into context + query
- Multi-Query GSM-Symbolic – each context is paired with up to ten synthetic follow-up questions, so one sleep-time summary can be reused.
- SWE-Features – pull-request fixes from SWE-bench to test an agentic coding workflow.
- Models
- Inference modes
- Baseline: test-time compute only. For non-reasoning models, specific prompts were used to elicit more or less reasoning.
- Sleep-time compute: context rewritten during idle time, tiny token budget at answer-time.
- Metrics & sweeps
- Accuracy (math) or F1 (SWE-Features) vs. avg. test-time tokens/question.
- Extra sweeps: parallel summaries k = 1, 2, 5, 10; predictability bins; 1→10 questions per context.
Experiment Results
Alright now let’s jump into the results, starting with Stateful GSM-Symbolic (P1 & P2), a math dataset.
1. Math benchmarks:

- On both P1 and P2 splits, sleep-time compute (blue) reaches the same accuracy with roughly 5 × fewer test-time tokens than the plain baseline (grey).
- Everything inside the blue wedge represents “free” gains, either higher accuracy at the same cost or the same accuracy at a lower cost.

- Across four models, the 5 × saving continues
- Only o1 surpasses the STC curve, and only when you let it spend 8k–10k tokens per question.
2. Scaling sleep-time compute itself
The researchers ran the sleep-time prompt k times in parallel, producing k independent summaries, and then prepended all of them before the user’s query at test time. On the plot, each curve’s color (from light gray to deep blue) indicates a different k value.

At the same answer-time token budget, increasing k from 2→5 boosts accuracy by up to +13 pp on GSM-P1 and +18 pp on GSM-P2. Beyond k = 5, returns diminish.
Key takeaways
- Parallel passes increases performance
- k ≈ 5 is optimal, returns start to diminish there afterwards.
- Zero extra inference cost. All performance gains come from the additional prepended summaries. Test time tokens stayed fixed.
3. Predictability and amortization
The more predictable the user questions are, the better sleep-time compute performs. This is becasue you can tune the sleep-time compute prompt to specifically address the very preditibale and frequent quetsions.

- The bigger the gap: in the top predictability bin sleep-time compute beats the baseline by ≈ 0.55 accuracy on GSM-P1.
4. Agentic coding benchmark

- On SWE-Features (more on this later), Claude 3.7 Sonnet with sleep-time compute outperforms normal processeing up until the thinking budget crosses ~8,500.
Key takeaways
- 5 × cheaper answers on both math benchmarks at the same accuracy.
- +13–18 pp accuracy when allowing for multiple runs of the sleep-time compute prompt.
- Bigger wins when questions are predictable or when several queries share the same context.
- Agent workflows benefit up until ~8,500 reasoning tokens
- Overall, sleep-time compute increases accuracy and decreases costs almost everywhere you have reusable context.
When should you use sleep-time compute?
Use it when the context remains steady and speed or cost per query really matter.
- Long-lived context: Documents, large codebases, chat histories persist across sessions.
- Multiple expected queries: The same background will be referenced more than once, so you can amortize the prep work.
- Tight latency or cost targets: You need fast responses and want to keep token spend low.
- Predictable queries: You have some sense of the types of questions users will ask, which allows you to tune your sleep-time compute prompt and produce better summaries.
You can skip sleep-time compute for:
- One-off questions - you’ll never ask about that context again.
- Wildly open-ended asks - advance summaries are likely to not have the right context.
- Ultra-hard, bespoke tasks - the model needs a full, long chain-of-thought tied exactly to the single question.
SWE-Features Case Study
Setup
SWE-Features is the SWE-bench slice focused on feature-addition pull requests.
For the example task the agent gets:
- the entire aider repo (Python)
- a seven-file PR that introduces a new
-issue
flag - a six-step repair prompt telling it to explore, plan, patch, test, and submit
Baseline vs. Sleep-Time Setup
Baseline Run
- At answer-time the model sees only the raw context:
- Uploaded files & working directory
- PR description
- Six-step repair plan
- It explores, plans, and patches in one go, burning ≈ 10 k live tokens.
Sleep-Time Compute Run
- During idle time, the model sees the raw context plus an “exploration” prompt and builds a one-page summary.
- At answer-time, the model sees:
- The same raw context (uploaded files, working dir, PR description, 6-step plan)
- The condensed summary prepended on top
- It follows the identical six-step plan but only needs ≈ 3 k live tokens to answer.

- With sleep-time compute, Claude 3.7 Sonnet reaches the same F1 score that the baseline only achieves after a 10 k-token chain-of-thought, with a 3× reduction in live-token spend and latency.
- Baseline only pulls ahead when it's allowed to user over 10k reasoning tokens per patch.
Why It Matters
Code-repair agents spend most of their budget exploring the repo, not editing files. Moving that exploration into sleep-time lets the model ship fixes faster and cheaper whenever follow-up questions hit the same repository snapshot.
Financial-Analyst chatbot example
Let’s look at another example for how a financial-analyst chatbot can use sleep-time compute to digest large financial documents ahead of time and answer market-day questions in milliseconds.
1. Standing context
Every night the bot downloads Acme Corp’s latest financial documents (10-Ks, 10-Qs, web-searches, etc).
2. Sleep-time pass
Ahead of time, we will have the model extract key items from the financial documents that we think our users are most likely to ask questions about.
Sleep-time compute prompt example:
You are OfflineMemory-Finance.
From the documents below extract:
• Fiscal-year revenue, gross margin, operating income, free cash flow.
• Year-over-year deltas for each metric.
• All new debt covenants or risk factors.
• Altman-Z-Score and quick ratio.
Call rethink_memory(<facts>) as many times as needed (max 10);
finish_rethinking() when no new information can be added.
3. Stored summary example
The output from the prompt above may be something like this:
- FY-2024 revenue $18.6 B (+8 % YoY)
- Gross margin 42 % (-1 pp YoY)
- Free cash flow $2.1 B (+12 % YoY)
- Net debt $4.2 B; new covenant: max Net-Debt/EBITDA < 3.0×
- Altman-Z-Score 3.1 (“safe” zone)
- Key risk added: chip-supply constraints could raise COGS.
4. Outcome
Parsing tables, footnotes, and ratio math happens once, overnight. During market hours, each analyst query can be answered in sub-second time by starting from the condensed summary instead of reprocessing the full 10-K.
Why this matters
SEC filings update only once a year, but analysts fire off dozens of follow-ups daily. Sleep-time compute turns a heavy 10-K scrape into a reusable knowledge capsule, delivering sub-second replies and slashing inference costs, all without a separate vector DB or retrieval layer.
Conclusion
Sleep-time compute is another way to try to increase peformance of LLM applications. By distilling your context into a reusable summary, it amortizes compute across multiple queries and delivers faster, more consistent results.
On the flip side, summaries can go stale if the underlying context changes. One-off or unpredictable questions may fall outside the pre-computed summary, and at very high live-token budgets a clean, focused chain-of-thought can sometimes edge ahead.
Overall, for any workflow where the same context is queried repeatedly under tight cost or latency constraints, sleep-time compute offers a high-leverage trade-off, but you'll still to need monitor freshness and have fallbacks for unexpected asks.