You know how it is: you build an LLM feature, test it with three sophisticated prompts, it seems solid—and two days later a user reports a case that is completely off the mark. Or even worse: the feature gradually deteriorates without you having changed a single line of code.
Welcome to the world of probabilistic systems. The problem isn’t your code, but your reliance on vibe-based testing.
“It worked once” in LLMs is roughly equivalent to “a unit test passed once locally on a colleague’s machine.” Nice—but worthless without reproducibility, coverage, and a baseline.
This post is a plea for eval-first. Eval tests are to LLM apps what unit tests are to deterministic C# code—but with a few crucial differences in their DNA.
Why LLM Testing Is Tricky
1. Non-determinism as a feature
LLMs are stochastic. Even with temperature: 0, nuances in tokenization or infrastructure updates can change the answer. A single test run is merely a random sample, not proof of correctness.
2. Quality is a vector, not a Boolean
In C#, a method is usually either correct—or it throws an exception or returns an incorrect output. With LLMs, the scale is broader:
- Relevance: Does the answer hit the mark?
- Faithfulness: Is the model hallucinating, or is it grounded in the data you provided (e.g., via RAG)?
- Tone: Is the answer professional, or does it drift?
3. The “silent regression”
Model updates at the provider (e.g., a minor update to GPT-4o), changes in embeddings in your vector database, or system prompt tweaks can ruin performance—while your classic unit tests remain happily green.
Evals vs. Unit Tests: A Comparison
| Feature | Unit Tests (Deterministic) | Evals (Probabilistic) |
|---|---|---|
| Input | Static parameters | Datasets (golden sets) |
| Output | Assert.Equal(expected, actual) | Scoring (0.0 to 1.0) via rubrics |
| Error pattern | Crash or logic error | Drift, hallucination, bias |
| Runtime | Milliseconds | Seconds to minutes (API calls) |
| Gate logic | 100% pass required | Average scores & regression thresholds |
A Minimal Setup
Treat prompts, retrieval parameters, and model routing like production code. Here’s a small checklist you can use for every LLM-related pull request:
✅ LLM Feature PR Checklist
- Dataset (Golden Set): At least 20–50 real-world test cases (inputs + desired outcomes), versioned as JSONL in the repo.
- Critical Failure Modes: Are edge cases defined (e.g., “What happens if no data is found?”)?
- Rubric: Are there clear criteria for a “fail” (e.g., mentions of competing products, data leaks)?
- CI Gate: Are evaluations automated? Is a report generated that shows whether the change improved or worsened performance?
Conclusion
The switch to eval-first is the moment you stop hoping for “AI magic” and start doing software engineering. In a follow-up post, I’ll show a minimal eval pipeline using Microsoft.Extensions.AI.