Skip to content

Hedge calibration

Saying 'probably' in a way the reader can actually parse.

9 min

'Likely' means 70%. 'Plausibly' means 55%. 'Almost certainly' means above 95%. Every product has a hedge vocabulary; almost no product publishes it. The result is a language that looks calibrated but is, in practice, poetic. A user reading 'likely' has to guess whether it's a coin flip or near-certain.

The fix is small and publishable. Attach each hedge word to a visible band. A dot notation, three-to-four levels of fill, a one-line legend. The vocabulary becomes a notation. Users learn it in twenty seconds and carry it forever.

"Hedge words are a notation system. Publish the key. The rest is usable typography."
The pattern

Dot marks. Three to four bands. Pinned legend.

Four bands cover most hedge space: high (almost certainly), likely, possible, uncertain. Each hedge word in the answer carries a small dot cluster next to it, filled to match its band. A pinned legend maps each pattern to a probability range. Hover any hedge; the matching band lights up in the legend.

Calibrated hedges
Hover any hedge to see the band
Calibrated answer

Retention almost certainly improved in the new cohort, and likely by more than 2 points. The driver was most plausibly the onboarding redesign, though it's possible the pricing change contributed. It is unclear whether the effect persists past 30 days.

Key · hedges map to these bands
almost certainly
90–99%
likely
65–85%
possible / plausible
40–60%
unclear / uncertain
below 40%

Hover any hedge word; the matching band lights up in the key. A system every reader can learn in 20 seconds.

The why

Hedging is language. Language needs a key.

When 'likely' means different things in different sentences, the user stops parsing hedges. Eventually they ignore them, which defeats the whole point. A shared key turns hedges into measurable signals. Suddenly the model is calibrated, in a way the user can actually verify.

Three moves

Hedges that mean something.

  • Pick four bands, not seven. More bands sound precise; they're harder to remember. Four fits the working memory of every reader.
  • Use typography, not color. Dots scale better than background tints, especially in long answers.
  • Pin the key. The legend shouldn't live in a help doc. Show it under the response, once, so readers can learn it inline.

The trap

Hedges that don't match the math.

The worst failure mode is a calibration key that doesn't match the model's behavior. The model says 'likely' in sentences that are actually 95% certain and other sentences that are actually 60%. The key becomes a promise you can't keep. Users feel the mismatch over time, then distrust every hedge.

Failure modes

What this pattern gets wrong when it gets wrong.

Confidence theater
Language or typography that performs certainty beyond what the model actually has.
Empty disclaimer
A legal-feeling warning that carries no specific information about this particular response.
Unverified claim
A figure or fact presented without provenance, in a place where the reader will treat it as cited.
Seen in the wild

Three shipping variants worth copying.

  • A 3-dot confidence mark next to every hedge word
  • A pinned legend explaining what each mark means
  • Hedges that can't be calibrated are explicitly omitted