Back to library
AI & Tooling

Building eval harnesses you'll actually trust

An eval suite you don't trust is worse than none — it gives you false confidence. Here's how to build one you'll lean on.

JL
Jordan Lee
Contributor · learn.curry.io · May 24, 2026
11 min
Building eval harnesses you'll actually trustAI & Tooling

Most teams know they should have evals. Fewer have ones they actually consult before shipping. The gap is trust — a suite full of vague checks and flaky scores gets ignored within a week. A small suite you believe gets used every day. Trust is the real design goal.

Start from real failures, not imagined ones

Don't sit down to "write evals" in the abstract. Pull the last twenty things your system got wrong — from logs, from user reports, from your own testing — and turn each into a case. Your eval set should be a museum of every way you've been burned. That's what makes a green run mean something.

Decide what "good" means before you score

For each case, write down what a correct answer looks like and how you'll judge it: exact match, a rubric, a model grading against criteria you wrote. The act of defining this surfaces disagreements on your team about what you're even building — which is worth the afternoon by itself.

The trust test
Ask one question of every eval: "If this fails, would I block the release?" If the answer is no, it's noise. Cut it or fix it until the answer is yes.

Make failures legible

A score of 0.82 tells you nothing actionable. When a case fails, your harness should show the input, the expected behavior, what you got, and why it was marked wrong — in one screen. Debuggability is what turns an eval run from a chore into the first place you look when something regresses.

Keep it small, keep it honest

A focused suite of fifty cases you trust beats two thousand you don't. Add a case every time something breaks; retire cases that no longer teach you anything. And guard against the quiet failure mode — tuning your system to pass the evals rather than to be good. The suite is a proxy. Keep checking it still points at reality.

Get this right and evals stop being the thing you feel guilty about skipping. They become the dashboard you can't imagine shipping without.

Want a coach in your corner?

Book a 1:1 call — we'll map your next step and pressure-test your plan. Formal courses coming soon.

Book a call