Read-aloud

Text-to-speech that you can follow with your eyes.

9 min

Text-to-speech is mostly a compliance feature everyone builds: a play button in the corner, a robotic voice, a dropdown nobody opens. The products that actually get used let the user watch the words move with the voice, jump to any sentence, and control the pace without hunting for a scrubber.

The pattern that works is a duet, not a swap. The text and the voice move together. The user can lead or follow. Either is fine. Both is better.

"A voice without a visible transcript is a voice the user can't trust."

The pattern

One active sentence, always.

At any moment during playback, exactly one sentence is the active sentence. It's highlighted with a quiet, non-distracting tint. A small left-border accent runs along it. The cursor of attention is the sentence, not the word words pass too quickly for the eye to track meaningfully, and anything finer than a sentence fights with reading comprehension.

Click any sentence and playback restarts from there. Pause and the highlight stays, so the user can glance away and come back. Resume and the voice picks up at the start of the active sentence, not mid-word.

Click any sentence to restart

Pace snaps to four discrete rates. Voice labeled synthetic.

Click any sentence to restart from there

Voice labelled synthetic · transcript is the source of truth

The why

Modality is a contract.

When a product reads its output aloud, the user's trust is running on two surfaces at once: the words on screen and the words in their ears. If those two surfaces drift because the voice is faster, because the voice skips a sentence the text shows the user loses track of which one is authoritative. The visible transcript has to be the source of truth. Always.

This is also a reason to clearly label the voice as synthetic. Users can tell, but naming it removes the small doubt, especially on mobile.

Three moves

The read-aloud I'd ship.

Four pace presets, not a slider. 0.9x, 1x, 1.25x, 1.5x. A continuous slider asks users to calibrate. Four presets give them a decision.
Sentence is the atom. Do not animate word-by-word. It looks impressive for five seconds and becomes exhausting to read under. The sentence is the unit of human listening comprehension.
Pause leaves the cursor in place. Coming back to a paused track should start at the top of the current sentence, not where the audio was cut. Nobody wants to hear "and that's the third point" as their resume.

The trap

Voice pulling ahead.

The most common implementation bug is that the voice pulls ahead of the highlight by a word or two latency between the audio thread and the render thread. Users register it as confusion, not as a bug. Always tune the sync so the highlight leads by a hair. A user who's looking slightly ahead of the voice feels in command. A user who's looking slightly behind feels behind.

Failure modes

What this pattern gets wrong when it gets wrong.

Modality mismatch: The product answers in one modality when another was implied, or mixes modalities in a way the user can't combine.
Ambiguous state: Running, done, errored, paused all look the same. The user has to infer from context.

Seen in the wild

Three shipping variants worth copying.

A moving highlight that tracks the currently-spoken sentence
A click-any-sentence-to-restart behavior that feels like a DVD chapter
A pace dial that snaps to 0.9x, 1x, 1.25x, 1.5x not a continuous slider