The eval dock

A live pass/fail grid across a frozen test set.

12 min

Every team shipping AI eventually confronts the same question: is the thing I just changed better than the thing I had yesterday? Teams that answer it with feel ship regressions. Teams that answer it with a small, persistent grid of frozen cases and live pass/fail results ship improvements. The eval dock is that grid.

The dock is not a dashboard. Dashboards are for people who look at metrics. The dock is for people who are editing prompts. It sits next to the work.

"Vibes are a starting point, not a release gate."

The pattern

A persistent grid, a live result.

The dock shows rows of test cases. Each row has a name, a model, a status (pass, fail, regressed, new), and a timestamp of last run. The header shows aggregate counts. Clicking a row expands the input, the expected, and the actual. Running a case is one keystroke. Running the whole set is one button.

The eval dock is a tool for people writing prompts. It should be present in the prompt editor, not buried under a "reports" menu. If you have to navigate away from your work to run the tests, you won't run the tests.

12 cases, 3 regressions

Click a row to see the failing output. Re-run to close the loop.

Last run · 12 cases

✓ 8 pass✗ 2 fail∘ 2 other

summary under 80 words

C03 · suite · format

fail

Produced 94 words in last run.

The why

Shipping AI without evals is gambling.

A prompt change that feels like an improvement is often a trade. The new prompt helps on the five cases you checked by eye and hurts on the twenty you didn't. Without an eval dock, the hurt cases are invisible until users find them. With an eval dock, the hurt cases show up as red rows the moment you run.

Teams that ship evals together with their prompts do not move slower. They move with more confidence, which lets them take bigger swings. A team without evals is a team tiptoeing around the prompt file.

Three moves

The dock I'd ship.

Cases are assertions, not vibes. Each case has a machine-checkable pass criterion a regex, a substring, an LLM-graded criterion, a JSON shape. "It should sound good" is not a test.
Regressions are loud. A case that passed yesterday and fails today should be visually distinct from a case that has always failed. Regression is the most interesting signal in the grid.
One keystroke to run. The fastest feedback loop wins. If the developer has to click four things to re-run, they'll stop running. Bind it to a hotkey and make it loud.

The trap

A case set that doesn't evolve.

An eval dock is only as good as the cases in it. The worst failure of this pattern is a team that runs the same 12 cases for six months. The product moves; user failure modes change; the eval becomes a ritual that measures nothing. The dock should make it easy to add a failing user report as a new case.

The right rhythm is: user reports a bug, team adds the bug as a test case, team fixes the prompt until the case passes. The dock grows with the product.

Failure modes

What this pattern gets wrong when it gets wrong.

Unverified claim: A figure or fact presented without provenance, in a place where the reader will treat it as cited.
Ghost citation: A source is shown but doesn't actually back the claim, or links to a page that doesn't contain the quoted text.
Confidence theater: Language or typography that performs certainty beyond what the model actually has.

Seen in the wild

Three shipping variants worth copying.

A sticky dock at the bottom showing the last three eval runs at a glance
A click-through on any red cell that opens the failing case with model output
A quarantine button that marks a case non-blocking until a human reviews it