Knowlify
CatalogStart learning
How to evaluate LLM outputs without going crazy.
AI at Work

How to evaluate LLM outputs without going crazy.

You shipped a feature that calls an LLM. Now the question lands: how do you know it's working? The answer is duller than you'd hope, and more useful.

AI at WorkTools & TutorialsCareer Growth
Published April 30, 2026
4 min read
Share

You shipped an LLM-powered feature. It works in the demo. Three weeks later your PM asks "is the new prompt actually better than the old one?" and you realize you have no idea how to answer.

This is the eval problem. It's the part of building with LLMs that nobody talks about, partly because the solutions look unglamorous, and partly because most teams genuinely don't have good ones.

Here's the working version.

Three levels of evaluation, in order

You need all three. Skipping levels is how teams ship features that look great in demos and break in production.

LevelWhat it testsHow to do itWhen you need it
1. Vibe check"Does this look right to me?"You + the team eyeball 20 outputsDay one. Always.
2. Golden set"Does the new version do better than the old on cases I care about?"50-100 hand-picked inputs with expected outputs. Diff the two versions.Once you have a prompt people depend on.
3. Production telemetry"What's actually happening in the wild?"Log every input + output. Sample 1%. Flag anomalies.Once it's in front of users.
You add a level only when the level below it isn't catching enough. Skipping is tempting because it's faster — and it's how regressions get to production.

What "golden set" actually means

This is the level where most teams stall. They imagine they need a thousand hand-labeled cases. They don't. They need fifty cases that genuinely represent the variety of inputs the system will see.

  • 10 cases that should "just work"
  • 10 cases that are slightly weird (typos, edge formatting, partial inputs)
  • 10 cases that have failed historically
  • 10 cases that are adversarial (a user trying to break the system)
  • 10 cases of genuinely hard real inputs

For each case, write down what a good output looks like. Doesn't have to be the exact output — just enough that you can tell a good response from a bad one. When you change the prompt, re-run all 50 and diff. You'll see immediately which categories of input the change helped and which it hurt.

The trap to avoid

Don't grade outputs with another LLM as your only eval. LLM-as-judge is fine as a signal — but it has the same blind spots as the model you're evaluating. At least one of your eval levels has to involve a human reading the actual output. Skip that and you'll ship regressions that no model would have caught.

Every team that's had a major LLM regression in 2025 had been relying on LLM-as-judge as their only post-deploy check.

What to measure

You don't need fancy metrics. For most LLM features, three categories cover almost everything:

Correctness. Did it give the right answer? (Binary for things with a right answer; rubric for things without.)

Tone. Did it sound how you wanted? (Friendly, terse, professional — pick one and grade against it.)

Hallucination rate. How often did it make things up? (Count claims that aren't supported by the input.)

Track these per category of input from your golden set. The number to watch isn't the average — it's the worst category. A feature that's 95% correct on average but 40% correct on edge cases is a feature that will spectacularly fail in front of your most important user.

The cadence

Run the golden set every time you change the prompt. Take 20 minutes once a month to review production samples manually. Pull anomalies into the golden set. Repeat.

That's the loop. It's boring. It's also the entire difference between "we shipped an LLM feature" and "we have an LLM feature that works".