The streaming cadence
Why your token rate is a design decision, not a backend one.
Streaming was supposed to solve the feel of AI products. For a while, it did. Tokens arriving one by one felt alive. Then everyone shipped streaming and the effect wore off. What's left is the uncomfortable fact that streaming has become, for most products, just the default texture of latency.
It doesn't have to be. Streaming can still feel like thought, if you treat the cadence as a design surface. The rate isn't a backend decision. It's the pace of the performance.
"A good answer arrives in the order a thoughtful colleague would say it out loud. Not faster. Not slower."
Structure first, then prose.
The single highest-leverage move in streaming is to stream the structure of the answer before any of the body. One sentence that names the shape. "Three things matter here." "Two paths, one recommendation." "I'm going to walk through this in order, oldest first." Then the body.
This gives the user a mental container to pour the rest of the stream into. They stop reading word-by-word and start reading slot-by-slot. The perceived latency of the whole response collapses, because the user already knows the answer's shape by the time the body arrives.
Pick a mode to see the difference.
Human mode pauses at structural breaks. Token-rate mode streams at a fixed interval.
The cadence I'd ship.
- Pause at structure, not at tokens. A clean pause of 200 to 320ms between sections reads like thought. A uniform token rate reads like a machine.
- Open with a promise. The first sentence of the stream should make a shape-promise the rest of the answer will fulfill. If you can't pre-commit to a shape, you don't have an answer yet — show a thinking indicator instead of starting to type.
- End softly. Do not animate a cursor past the last word. Fade it. A hard click at the end of a stream is a UX tell — it says "done" in a way that invites regeneration when regeneration is not wanted.
Cadence is trust.
There's a reason the three-rule cadence feels honest. It mirrors the way humans actually think when they're answering a hard question out loud. They commit to a shape early ("three things matter"), pause at structural joints ("first, second, third"), and taper at the end rather than stop dead.
A product that streams this way inherits the credibility of careful speech. A product that streams at a uniform token rate inherits the credibility of a loading bar. The backend can be identical. The product is not.
What this pattern gets wrong when it gets wrong.
- Latency lie
- The interface pretends speed the backend doesn't have. Spinners that bounce faster than the real throughput.
- Silent truncate
- The response ran out of room or tokens and the product didn't tell the user where it stopped.
- Confidence theater
- Language or typography that performs certainty beyond what the model actually has.
Three shipping variants worth copying.
- The list-first answer (structure streams before prose)
- A fading cursor that signals 'still thinking' versus 'still writing'
- A soft end-of-message rule, never a hard click