The voice composer

Dictating a prompt without losing the option to edit.

10 min

Voice is the fastest way to load context into an AI product. It's also the easiest way to produce a transcript the user can't read. The middle path is a composer that transcribes live into editable text, with a visible waveform, with low-confidence words flagged so the user can fix them before sending.

The pattern isn't dictation. Dictation assumes the transcript is the product. The composer assumes the transcript is the input, and the user gets to fix it before it becomes an instruction.

"Voice is input. Text is intent. The composer is the bridge. Never confuse the three."

The pattern

Live transcript. Waveform. Edit-on-click.

A waveform confirms the recording is working. The transcript fills in word by word. High-confidence words are set in regular weight; low-confidence words carry a dotted red underline and are click-to-edit. A toggle lets the user send as audio (the raw clip plus the transcript) for moments when text can't hold the intention.

Live dictation with editable transcript

Record, watch the words land, fix the ones in red

Tap record. Speak. Watch the live transcript fill in. Low-confidence words get a dotted underline.

Speak, skim, edit. Low-confidence words are clickable. The clip is optional, not automatic.

The why

The mouth is faster than the keyboard, but wronger.

Most users can speak three times faster than they type. They also produce more errors per unit time. A voice composer that hides errors makes speech useless, because the user can't trust what gets sent. A composer that surfaces low-confidence words makes speech genuinely faster: talk, glance, fix, send.

Three moves

Voice that respects the text.

Waveform, not microphone icon. The waveform is proof of life. The icon is decoration.
Flag low-confidence words. Click-to-edit. No modal. The user fixes in place.
Offer 'send as audio.' Sometimes the tone matters. The model should be able to receive the clip, not just the transcript.

The trap

Voice UIs that pretend the transcript is perfect.

The most common failure mode is a voice interface that renders the transcript in one uniform weight, with no signal about which words the ASR was unsure about. The user skims, misses the misheard word, sends, and the model answers the wrong question.

Failure modes

What this pattern gets wrong when it gets wrong.

Modality mismatch: The product answers in one modality when another was implied, or mixes modalities in a way the user can't combine.
Consent skip: Capturing, transmitting, or acting on input the user didn't agree to share in this moment.
Confidence theater: Language or typography that performs certainty beyond what the model actually has.

Seen in the wild

Three shipping variants worth copying.

A composer that transcribes live as the user speaks
Low-confidence words get a dotted underline, editable
A 'send as audio' option preserves the raw clip