Speculative Prefilling
If you know what a user is gonna do next, you can prefill your KV cache with candidate contexts and significantly improve your time-to-first-token.
The Interaction
I was revisiting an interaction idea I had a bit ago- if we have a great model of user context, then we can effectively build an “everywhere” tab to autocomplete. I called this system Tabracadabra 🎉. Here’s the demo video:
Pretty cool, huh?
The Problem
Now what I kinda hid from you here is that this little clip is sped up a TON. I’m talking 10x. That spinner spins for quite some time before you see anything. Here’s the non sped-up video:
This time where that spinner is going off is known as the time to first token (or TTFT) for short. You might also notice that once the first token appears, the actual decoding speed is not so bad!
What exactly is going on when the spinner is running? This phase is known as the prefill phase. Here, the model processes all input tokens in parallel, building the KV cache. Unlike the decode step, we have all the inputs at the very start, computing the full attention matrix at the start! Awesome!
In theory, this should be a LOT faster than decoding, right? We can parallelize the prefill phase, but not the decode phase! So what’s going on?
This is because, in the Tabracadabra setting, the context is an order of magnitude larger than what’s generated. For that demo example, I’m retrieving all of my past emails, screenshots of me in similar settings, etc. etc. We end up in a setting where the context can be 20x-30x larger than the text that’s actually generated.
On top of that, the actual sampling speed isn’t really a big interaction bottleneck. Users can early-interrupt sampling and work from a partial autocomplete; but we can’t partially remove parts of the context. And the completion animation looks cool!
Speculative Prefilling
Here’s the idea: if we have a reasonable idea of what the user will do next, then we can speculate and kick off the prefill stage with candidate contexts! Then when a user presses tab, we can start the decode phase. This is the same intuition behind speculative decoding (pre-generate tokens that might get used, and accept or reject them later) except here we’re speculating over the prefill rather than the decode. That’s it! Here’s a comparison video:
The time to first token here is effectively nothing.
Under the hood, we use ideas from two of my papers (access to a user model, see GUM and NAP). With a user model pθ, we can predict what a user will do next given some context. The rough algorithm follows:
- On context update (e.g. you open an app), sample the top-k likely next actions from pθ and kick off a prefill of pθ with retrieved context corresponding to each predicted action, adding them to C with the current timestamp.
- Continuously expire entries in C whose context snapshot whose timestamp exceeds a TTL threshold — evicting their KV cache entries.
- On autocomplete trigger, check if the current context (via an embedding match on screenshot) matches any live candidate in C.
- If a match exists, decode immediately (TTFT ≈ 0).
- Otherwise, fall back to slow prefill, then decode.
Of course, this depends on whether or not we have a context match. There are only so many things a user could autocomplete on their screen at a time, so I expect a reasonable payoff—but worth exploring more!
Systems and User Models
A small aisde—I think we can view some of these user models as general purpose human branch predictors. I’ve applied this idea here to LLMs, but you could speculate on any kind of application that might benefit from speculative execution with a user model!
Anyway, if you found this interesting, please consider citing:
NAP (Next Action Prediction):
@misc{shaikh2026learningactionpredictorshumancomputer,
title={Learning Next Action Predictors from Human-Computer Interaction},
author={Omar Shaikh and Valentin Teutschbein and Kanishk Gandhi and Yikun Chi and Nick Haber and Thomas Robinson and Nilam Ram and Byron Reeves and Sherry Yang and Michael S. Bernstein and Diyi Yang},
year={2026},
eprint={2603.05923},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.05923},
}GUM (General User Models):
@misc{shaikh2025creatinggeneralusermodels,
title={Creating General User Models from Computer Use},
author={Omar Shaikh and Shardul Sapkota and Shan Rizvi and Eric Horvitz and Joon Sung Park and Diyi Yang and Michael S. Bernstein},
year={2025},
eprint={2505.10831},
archivePrefix={arXiv},
primaryClass={cs.HC},
url={https://arxiv.org/abs/2505.10831},
}