The model fallback

Choosing a cheaper model without surprising the user.

10 min

Every serious AI product routes. Short prompts go to fast models. Long threads go to cheap models. Pro users pay for flagship models. This routing is everywhere, and in most products it's invisible. The user sees a response and has no idea which brain produced it.

The good pattern is disclosure. Every response carries a chip: which tier answered, how long it took, what it cost. The user can one-tap 'redo with the bigger model' and see the delta. The routing is still automatic; the identity is just no longer a secret.

"The model that answered is part of the answer. Hiding it doesn't make the product faster. It makes the product dishonest."

The pattern

Model chip. Latency. Cost. Redo.

The chip lives at the top of every response, muted, not loud. It names the model by tier and version. Next to it, two figures: how long this response took, and how much it cost. Below the response, a single affordance: redo with the next tier up. The user can always see the whole shape of the trade-off.

Three tiers, same question

Click the tiers to see the answers differ

Answered by

flash · fastrouted here due to thread length

latency · 0.4scost · $0.001

The Q3 plan focuses on enterprise ACV. Three workstreams: pricing, packaging, and sales enablement. Risk: hiring delay.

Every response names the model that produced it. The cheap tiers are useful when honest about being cheap.

The why

Silent fallbacks erode every assumption.

When a product silently downgrades the model, the user's sense of product quality becomes a coin flip. They might get the flagship, they might get the fast model. Over enough turns, they stop trusting any answer, because they can't tell which model's strengths to weight. A visible chip restores the ability to evaluate.

Three moves

Honest routing.

Chip, not banner. The model identity is metadata, not a warning.
Redo with delta. The redo affordance should show how much the answer would change, not just let the user guess.
Name fallbacks in the timeline. When the model tier changes mid-thread, the session timeline should carry a small dot so the user can see when and why.

The trap

Routing that hides its own logic.

The failure mode is a chip that says 'flash' when the real call was 'mid', or a tier label that nobody at the company can define. The chip has to match the routing, and the routing has to be describable. Otherwise the disclosure becomes a new form of dishonesty.

Failure modes

What this pattern gets wrong when it gets wrong.

Confidence theater: Language or typography that performs certainty beyond what the model actually has.
Phantom tool: A visible tool call that didn't happen, or happened but with different arguments than shown.
Unverified claim: A figure or fact presented without provenance, in a place where the reader will treat it as cited.

Seen in the wild

Three shipping variants worth copying.

A model chip on each response showing tier + version
A 'redo with [bigger model]' button that shows delta
Fallback events get a muted dot in the session timeline