Skip to content

Image annotation

Pointing at a screenshot without leaving the chat.

9 min

Images are inputs. The user uploads a screenshot, a photo, a diagram, and then describes what they mean. The description is always longer and less precise than a pointing gesture. An annotation layer makes the gesture first-class: circle something, the model answers about the circled thing, the prompt builds itself.

The pattern makes deixis explicit. 'Reason about this thing I pointed at' replaces 'the third button from the left in the bottom row.' The interaction moves from typing to tapping; the precision goes up; the frustration goes down.

"A circle is worth a thousand words of prompt. When the model can see the circle, the user stops writing essays."
The pattern

Three tools. Label on placement. Focus-only toggle.

Three basic tools: circle, box, arrow. Dragging on the image places the mark; a label appears next to it. The prompt builds itself out of the annotations: 'reason about {{box 1}},{{circle 2}}'. A focus-only toggle narrows the model's attention to just the annotated regions. The rest of the image fades but stays visible.

Drawing on a mock UI
Circle, box, arrow — build the prompt by pointing
|
Top nav
Chart
Metrics
Table
drag on the mock to draw · 0 annotations
Prompt to model
draw to build the prompt
Annotations

None yet. Circle something.

Draw on the image. The prompt builds itself out of what you circled. Focus-only narrows the model's attention.

The why

Pointing is a protocol humans already know.

Users have been pointing at things their whole lives. Text descriptions are an unnatural fallback. When the interface makes pointing possible, users get more accurate faster. The model responds to the actual thing, not to the verbal approximation of it.

Three moves

Annotations that carry their weight.

  • Label on placement. Every mark should get a default label the user can rename. 'Circle 1' is a prompt slot, not a decoration.
  • Focus-only toggle. Some questions should ignore the rest of the image. A toggle gives the user control.
  • Undo and remove. Annotations are cheap to make and cheap to revert. Keep them that way.

The trap

Annotations the model ignores.

The worst failure mode is an annotation layer that the model doesn't actually condition on. The user circles something, the model answers about the whole image, the circle turns out to have been decorative. The user learns the annotation was theater.

Failure modes

What this pattern gets wrong when it gets wrong.

Modality mismatch
The product answers in one modality when another was implied, or mixes modalities in a way the user can't combine.
Phantom tool
A visible tool call that didn't happen, or happened but with different arguments than shown.
Unverified claim
A figure or fact presented without provenance, in a place where the reader will treat it as cited.
Seen in the wild

Three shipping variants worth copying.

  • A circle / arrow / box tool over any uploaded image
  • Annotations become part of the prompt, labeled
  • A 'focus on annotated only' toggle for the model