Prompt diff

Two prompts, side by side, same inputs, different outputs.

11 min

A prompt change is a deploy. Teams that don't treat it that way ship regressions and find out later. The prompt diff is the discipline that makes a change legible: two versions, one shared input, the same model, the outputs rendered side by side. You see what the words did, not what you hoped they would.

A good diff is an argument settler. Two engineers can disagree on whether a rewrite "sounds better" for thirty minutes. Show them both outputs on the same five cases and the disagreement either ends or sharpens into a test.

"A prompt change without a diff is a deploy without a test."

The pattern

Two panes, one input, identical model.

The diff has three locked variables (input, model, tools) and one changing variable (prompt). Left pane is version A; right pane is version B. Between them sits the input. Above them sits the model and temperature. The user scrolls through a set of inputs and watches the outputs change in lockstep.

The best diffs highlight the differences within the outputs themselves. Not just "here are two paragraphs" but "these three sentences changed, these phrases are new, this tone shifted." The visual delta is what makes the review fast.

Two prompts, five inputs

Flip between inputs. Watch the two outputs move together.

Shared input · "Summarize yesterday's standup for the team. Keep it short."

model · mid · temperature 0.2 · seed 42

v3 · current

Prompt

You are an assistant. Summarize the provided standup notes. Keep the summary short and professional. Use bullet points. Include any risks.

Output

• Frontend: shipped the composer redo, added tests.
• Backend: P99 still elevated, investigating.
• Risks: retrieval regression under load.

v4 · proposed

Prompt

You are a standup summarizer. Write two short sentences, then a one-line risk. No bullets. No headings. Name the team lead when a team is mentioned.

Output

Frontend (M) shipped the composer redo with tests. Backend (E) is investigating an elevated P99 that showed up overnight.
Risk: retrieval regression under peak load.

+ shorter · + named leads · − no bullets

Promote writes a new named version. Nothing ships from the diff view.

The why

Prompts drift without diffs.

In teams without prompt diffs, the working prompt becomes a living document that nobody fully understands. Someone adds a rule; someone else reorders the examples; a third person rewrites the system message; the prompt that ships on Tuesday bears no clean lineage to the prompt that shipped on Monday.

A diff forces each change to stand on its own. You can see which edit made the regression that users are reporting. You can undo. You can explain. This is the difference between engineering and alchemy.

Three moves

The diff I'd ship.

A case set, not one input. Diff on one input is a vibe check. Diff on 10 carefully chosen inputs is a review. Teams should curate their case set the way they curate tests.
Inline highlighting of the delta. Word-level diff in the output, not just character-level. A rewrite that changes one word should be visually distinct from a rewrite that changes the whole paragraph.
Commit from the diff. The diff should be the merge tool. If a user has to copy-paste the winning prompt into a separate file, you've introduced a place for the real change to diverge from the reviewed change.

The trap

The confident single-input diff.

The fastest way to ship a regression is to diff one input, see the output improve on that input, and merge. The prompt change made that one case better and twelve other cases worse. The diff feels like science; the eval that would have caught it was the case set you didn't run.

Pair prompt diff with the eval dock. A change that passes diff and passes the eval is a change you can trust.

Failure modes

What this pattern gets wrong when it gets wrong.

Unverified claim: A figure or fact presented without provenance, in a place where the reader will treat it as cited.
Confidence theater: Language or typography that performs certainty beyond what the model actually has.

Seen in the wild

Three shipping variants worth copying.

A word-level red/green diff for the prompts themselves
A linked output diff below with matching highlights
A 'promote left' button that writes the winning prompt to a named version