Eval is a product, not a script
Why most LLM evals quietly rot — and the small shifts that keep them honest as your system evolves.
Eval is a product, not a script
Most teams ship evaluation scripts early, ship them once, and never touch them again. This is why evals fail.
The rot cycle
When your LLM application is new, your eval script feels complete. You define your success metric, wire it up to your CI/CD pipeline, and move on. The metric sits there, dutifully returning scores.
But your application evolves:
- You add a new feature (RAG, tool-use, streaming)
- Your user base shifts to a new domain or language
- You discover a failure mode through production logs
- Your downstream task changes slightly
Your eval doesn't change. It still reports the same number. Teams see this and do one of two things:
- They trust the number and ship anyway — then get surprised in production
- They distrust the number and rebuild it — wasting weeks on infrastructure
Either way, your eval is now a technical debt tax on every iteration.
Why this happens
We treat evals like code. We write them once, we test them, we deploy them. But evals are more like UX than code. They're a measurement interface between your product and your success metric.
Just like UX rots when your users change, evals rot when:
- Your task semantics evolve
- Your model's failure modes shift
- Your data distribution drifts
- Your stakeholders' priorities realign
The standard response is to rebuild in a panic. The better response is to treat eval development like product development:
Practical shifts
1. Version your evals
Tag each eval with the context it was valid for:
@eval.version("1.0", valid_for="single-turn QA on English news")
def accuracy_at_1(predictions, references):
return sum(p == r for p, r in zip(predictions, references)) / len(predictions)When you change the eval, increment the version and document why. This makes it obvious when a performance improvement is real vs. when your metric just shifted.
2. Monitor eval coverage
Track which failure modes your eval actually catches. If you ship something, then immediately break it in production in a way your eval didn't catch, you've found a blind spot.
Add that case to your eval. This is how metrics evolve.
3. Make evals cheap to iterate on
If spinning up a new eval variant takes 4 hours, you won't try variations. If it takes 4 minutes, you will.
Invest in:
- Fast evaluation loops (cached model outputs, parallel execution)
- Easy parameterization (template languages, config files)
- Quick feedback (streaming results, progress bars)
- Low friction integration (git-tracked eval definitions, easy rollback)
4. Rotate eval ownership
The person who wrote your eval gets attached to it. They unconsciously defend it against criticism, tune it to their biases, and resist changing it even when the data screams that it's wrong.
Rotate eval maintenance between team members. Fresh eyes catch what veterans defend.
5. Treat eval changes as feature work
When you update an eval, commit it to your feature branch. Run it against your baseline model and your new model. Include the delta in your PR description:
Eval change: Updated QA eval to include multi-hop reasoning cases
- Baseline: 67% → 61% (3 new hard cases added)
- This PR: 63% → 65% (net +2, regression on hop-1 only)
This makes it transparent that the metric shifted, not that your model regressed.
The meta-pattern
Evals are measurements of your system's quality at a point in time. That system changes. Your environment changes. Your definition of quality changes.
The teams that win aren't the ones with the perfect eval. They're the ones that treat evaluation as a continuous product, not a one-time script. They iterate on their metrics as fast as they iterate on their models.
Your eval isn't done. It's just shipped. There's a difference.