Latency trace

A waterfall of where the seconds went.

10 min

Slow AI products feel slow for specific, diagnosable reasons. The problem is that they feel slow as a single experience, so the team argues about which part is slow based on vibes. The latency trace ends the argument by putting every second on a waterfall: retrieval took this long, the model took that long, the tool call took this long, the render took that long.

Once a team has a trace open alongside the product, the discussion changes. It stops being "it feels slow" and starts being "retrieval is 3 seconds, let's halve it." Traces make performance actionable.

"Perceived latency lies. Traces don't."

The pattern

A waterfall of named spans.

The trace is a horizontal timeline. Each span is a labeled bar: retrieval, system-prompt assembly, first-token, streaming, tool-call, tool-return, render. Spans nest so a tool call can contain its own internal spans. A vertical cursor shows "first visible token" and "final token," the two moments users actually care about.

Traces should be cheap to capture. If producing a trace requires a special build flag, developers won't capture one until after the complaint has been in the backlog for a month. Ship the trace with the product, hidden behind a small debug affordance.

One request, seven spans

Hover any span for the timing. Click to expand sub-spans.

Waterfall · two runs of the same query

Trace A · before retrieval tuning · 3380 ms

embed query · 80ms

index.search · 160ms

fetch top-8 docs · 220ms

model.draft · mid · 1400ms

tool.browser.fetch · 620ms

model.refine · mid · 760ms

render.stream · 140ms

Trace B · top-4 retrieval, no browser fetch · 1820 ms

embed query · 80ms

index.search · 150ms

fetch top-4 docs · 120ms

model.draft · mid · 820ms

model.refine · mid · 540ms

render.stream · 110ms

retrievalmodeltoolrender

The why

Time has a shape.

Every AI product has a time budget and a user tolerance. A trace is how you find out where the budget went. Most teams are surprised by the answer the first time they look. Retrieval is longer than anyone guessed. The model is faster than anyone feared. The tool call is what's killing the experience, and nobody knew.

Traces also reveal the difference between wall-clock latency and perceived latency. A 10-second operation with a streaming first token at 600ms feels fast. A 3-second operation with no output until the end feels slow. Knowing the shape lets you trade correctly.

Three moves

The trace I'd ship.

First-visible-token is a first-class metric. It's the number that maps most tightly to how users feel the product. Surface it on the trace and in the product's internal dashboards.
Sub-span tool calls. When a tool call takes 4 seconds, the team needs to know if that's the retry, the third-party API, or the parsing. Sub-spans inside tool calls are where most diagnosis happens.
Compare two traces. "This request today vs the median request last week" is the question that matters when a user complains. The product should let them diff traces, not just view one.

The trap

Median is a lie.

The subtle failure of latency work is optimizing the median while the p95 quietly gets worse. A team that celebrates a 200ms median improvement can ship a change that adds a 5-second tail on 3% of requests. The tail is where users give up. Every trace view should have a p95 and p99 counter next to the median.

Tails eat trust. Watch the tails.

Failure modes

What this pattern gets wrong when it gets wrong.

Latency lie: The interface pretends speed the backend doesn't have. Spinners that bounce faster than the real throughput.
Throttle silence: A rate limit, queue, or budget cap that silently slows or stops the product without telling the user why.
Phantom tool: A visible tool call that didn't happen, or happened but with different arguments than shown.

Seen in the wild

Three shipping variants worth copying.

A per-turn waterfall below the composer in debug mode
A color-by-category rule (retrieval, model, tool, render)
A compare-traces affordance that overlays two runs