By AI Blog Editor
Apr 19, 2026 · 1 min read

The eval harness nobody regrets building

Stop shipping changes on vibes. A small seed set, an honest rubric, a fast grader, and a CI gate. How to build the measurement discipline that keeps you from lying to yourself.

Every team building with LLMs hits the same wall. A change feels better. Another change feels worse. The prompt grows, the model version shifts, the tools multiply, and nobody can say whether the whole thing is improving or just drifting. The fix is an eval harness, and the teams who build one almost universally regret not building it sooner.

Evals are not benchmarks. You don't care how Claude scores on MMLU. You care how your pipeline scores on your inputs with your success criteria. Start there.

Step 1: a tiny seed set

Collect 20–50 real inputs from your product. Not synthetic ones, not ones you wrote by imagining what a user might type. Real ones. Copy them from logs, from support tickets, from internal traffic. Label the ideal output for each.

This takes an afternoon. It's the single highest-ROI afternoon you'll spend. With 30 labelled examples you can distinguish "this change helped" from "this change felt good."

Step 2: a rubric that reflects your product

A rubric is a list of criteria and weights. "Correctness" always matters; the rest depends on the product. A legal summariser cares a lot about groundedness; a chatbot cares about tone; a code agent cares about whether the diff compiles.

Weights are a product decision, not a technical one. Two sensible engineers can weight correctness over conciseness or the other way around and both be right — their products are different. Drag the sliders below to see how the winning candidate flips when you change what you care about.

◆ Rubric weights

Correctness5

Did the answer actually solve the task?

Groundedness4

Is every factual claim supported by sources?

Conciseness2

No padding, no preamble, nothing extra.

Format2

Output matches the required shape exactly.

Safety3

No unsafe output, no PII leaks, refusals when apt.

Candidate scores

Baseline (cheap model)3.64 / 5
Tuned prompt4.10 / 5
◆ Bigger model4.26 / 5

The leader flips depending on the weights. That's not a bug — it's the point. Your rubric is a product decision.

Interactive · drag the weights, watch the leader change

Step 3: a grader

You need two graders, and they check each other. One is a fast automated grader — usually another LLM with a narrow rubric prompt ("score 1–5 on correctness, with one sentence of reasoning"). The other is a slow human spot-check on 10–20% of outputs, looking for failure modes the automated grader misses.

The two are not redundant. The automated grader gets you a fast signal on every change. The human spot-check catches the failure modes your grader is blind to — and those are often the reputationally expensive ones.

Step 4: wire it into CI

Evals that aren't in CI rot. Put the seed set in a test runner, run the pipeline against it on every PR, diff the scores against main. A 3-point drop on correctness should block merge the same way a failing unit test does.

Start with blocking on a single scalar score; add per-criterion blocks once the team trusts the signal. It's better to ship a too-strict eval gate and loosen it than to start with warnings and have them be ignored.

Four traps

Grading your training set. If you tune prompts by staring at the same 20 examples, those 20 examples are now useless for measurement. Hold out a separate evaluation set the prompt author doesn't see.

Treating the grader as ground truth. Your LLM grader has opinions too. Calibrate it — sample 30 outputs, grade them by hand, compare to the LLM grader, and adjust the rubric prompt until they agree.

Optimising for score, losing the plot. A model can game a rubric the same way a student games a grading scheme. Watch the shape of winning outputs over time. If they all start sounding weirdly similar, your rubric has a blind spot.

Growing the seed set too slowly. Every real failure in production is an eval you should have had. Build the habit of copying a bad output into the seed set the day you see it.

One line to remember

An eval you can run in sixty seconds before every merge is worth ten you could run quarterly. Start small, measure the things you care about, and let the harness grow with the product.

* * *

Thanks for reading. If a line here was useful — or plainly wrong — the comments are below and the newsletter has your back.

Elsewhere in this issue

3 more

Letters

Arguments, corrections, questions. Anonymous comments allowed; be kind, be specific.

The Loop