How to evaluate LLM outputs without going crazy.
You shipped a feature that calls an LLM. Now the question lands: how do you know it's working? The answer is duller than you'd hope, and more useful.
You shipped an LLM-powered feature. It works in the demo. Three weeks later your PM asks "is the new prompt actually better than the old one?" and you realize you have no idea how to answer.
This is the eval problem. It's the part of building with LLMs that nobody talks about, partly because the solutions look unglamorous, and partly because most teams genuinely don't have good ones.
Here's the working version.
Three levels of evaluation, in order
You need all three. Skipping levels is how teams ship features that look great in demos and break in production.
| Level | What it tests | How to do it | When you need it |
|---|---|---|---|
| 1. Vibe check | "Does this look right to me?" | You + the team eyeball 20 outputs | Day one. Always. |
| 2. Golden set | "Does the new version do better than the old on cases I care about?" | 50-100 hand-picked inputs with expected outputs. Diff the two versions. | Once you have a prompt people depend on. |
| 3. Production telemetry | "What's actually happening in the wild?" | Log every input + output. Sample 1%. Flag anomalies. | Once it's in front of users. |
What "golden set" actually means
This is the level where most teams stall. They imagine they need a thousand hand-labeled cases. They don't. They need fifty cases that genuinely represent the variety of inputs the system will see.
- 10 cases that should "just work"
- 10 cases that are slightly weird (typos, edge formatting, partial inputs)
- 10 cases that have failed historically
- 10 cases that are adversarial (a user trying to break the system)
- 10 cases of genuinely hard real inputs
For each case, write down what a good output looks like. Doesn't have to be the exact output — just enough that you can tell a good response from a bad one. When you change the prompt, re-run all 50 and diff. You'll see immediately which categories of input the change helped and which it hurt.
Don't grade outputs with another LLM as your only eval. LLM-as-judge is fine as a signal — but it has the same blind spots as the model you're evaluating. At least one of your eval levels has to involve a human reading the actual output. Skip that and you'll ship regressions that no model would have caught.
What to measure
You don't need fancy metrics. For most LLM features, three categories cover almost everything:
Correctness. Did it give the right answer? (Binary for things with a right answer; rubric for things without.)
Tone. Did it sound how you wanted? (Friendly, terse, professional — pick one and grade against it.)
Hallucination rate. How often did it make things up? (Count claims that aren't supported by the input.)
Track these per category of input from your golden set. The number to watch isn't the average — it's the worst category. A feature that's 95% correct on average but 40% correct on edge cases is a feature that will spectacularly fail in front of your most important user.
The cadence
Run the golden set every time you change the prompt. Take 20 minutes once a month to review production samples manually. Pull anomalies into the golden set. Repeat.
That's the loop. It's boring. It's also the entire difference between "we shipped an LLM feature" and "we have an LLM feature that works".