You know how it is: you build an LLM feature, test it with three sophisticated prompts, it seems solid—and two days later a user reports a case that is completely off the mark. Or even worse: the feature gradually deteriorates without you having changed a single line of code.

Welcome to the world of probabilistic systems. The problem isn’t your code, but your reliance on vibe-based testing.

“It worked once” in LLMs is roughly equivalent to “a unit test passed once locally on a colleague’s machine.” Nice—but worthless without reproducibility, coverage, and a baseline.

This post is a plea for eval-first. Eval tests are to LLM apps what unit tests are to deterministic C# code—but with a few crucial differences in their DNA.

Why LLM Testing Is Tricky

1. Non-determinism as a feature

LLMs are stochastic. Even with temperature: 0, nuances in tokenization or infrastructure updates can change the answer. A single test run is merely a random sample, not proof of correctness.

2. Quality is a vector, not a Boolean

In C#, a method is usually either correct—or it throws an exception or returns an incorrect output. With LLMs, the scale is broader:

  • Relevance: Does the answer hit the mark?
  • Faithfulness: Is the model hallucinating, or is it grounded in the data you provided (e.g., via RAG)?
  • Tone: Is the answer professional, or does it drift?

3. The “silent regression”

Model updates at the provider (e.g., a minor update to GPT-4o), changes in embeddings in your vector database, or system prompt tweaks can ruin performance—while your classic unit tests remain happily green.

Evals vs. Unit Tests: A Comparison

FeatureUnit Tests (Deterministic)Evals (Probabilistic)
InputStatic parametersDatasets (golden sets)
OutputAssert.Equal(expected, actual)Scoring (0.0 to 1.0) via rubrics
Error patternCrash or logic errorDrift, hallucination, bias
RuntimeMillisecondsSeconds to minutes (API calls)
Gate logic100% pass requiredAverage scores & regression thresholds

A Minimal Setup

Treat prompts, retrieval parameters, and model routing like production code. Here’s a small checklist you can use for every LLM-related pull request:

✅ LLM Feature PR Checklist

  • Dataset (Golden Set): At least 20–50 real-world test cases (inputs + desired outcomes), versioned as JSONL in the repo.
  • Critical Failure Modes: Are edge cases defined (e.g., “What happens if no data is found?”)?
  • Rubric: Are there clear criteria for a “fail” (e.g., mentions of competing products, data leaks)?
  • CI Gate: Are evaluations automated? Is a report generated that shows whether the change improved or worsened performance?

Conclusion

The switch to eval-first is the moment you stop hoping for “AI magic” and start doing software engineering. In a follow-up post, I’ll show a minimal eval pipeline using Microsoft.Extensions.AI.