Dev & eval

The surfaces that make AI products debuggable.

Prompts change, models change, latencies wobble, tokens leak. The interfaces that let a team diff a prompt, watch a trace, and run a small eval in-product are the ones that ship confident releases instead of prayerful ones.

4 patterns

Prompt diff
Two prompts, side by side, same inputs, different outputs.
11 min
The eval dock
A live pass/fail grid across a frozen test set.
12 min
Latency trace
A waterfall of where the seconds went.
10 min
Token map
A picture of which tokens did what, and why.
11 min

Other collections

The prompt surfaceHow the box asks for intent.
→
The responseHow the model performs the answer.
→
Agentic flowsHow the model shows its work, in the world.
→
Memory & contextWhat the model carries with it.
→
Trust & evidenceHow the model earns the reader's belief.
→
Multimodal & artifactsWhen chat isn't enough.
→
CollaborationWhen the session has more than one human in it.
→
OrchestrationHow many agents, at what cadence, inside what ceiling.
→