You know how it is: you build an LLM feature, test it with three sophisticated prompts, it seems solid—and two days later a user reports a case that is completely off the mark. Or even worse: the feature gradually deteriorates without you having changed a single line of code.

Welcome to the world of probabilistic systems. The problem isn’t your code, but your reliance on vibe-based testing.

“It worked once” in LLMs is roughly equivalent to “a unit test passed once locally on a colleague’s machine.” Nice—but worthless without reproducibility, coverage, and a baseline.

This post is a plea for eval-first. Eval tests are to LLM apps what unit tests are to deterministic C# code—but with a few crucial differences in their DNA.

Why LLM Testing Is Tricky

1. Non-determinism as a feature

LLMs are stochastic. Even with temperature: 0, nuances in tokenization or infrastructure updates can change the answer. A single test run is merely a random sample, not proof of correctness.

2. Quality is a vector, not a Boolean

In C#, a method is usually either correct—or it throws an exception or returns an incorrect output. With LLMs, the scale is broader:

Relevance: Does the answer hit the mark?
Faithfulness: Is the model hallucinating, or is it grounded in the data you provided (e.g., via RAG)?
Tone: Is the answer professional, or does it drift?

3. The “silent regression”

Model updates at the provider (e.g., a minor update to GPT-4o), changes in embeddings in your vector database, or system prompt tweaks can ruin performance—while your classic unit tests remain happily green.

Evals vs. Unit Tests: A Comparison

Feature	Unit Tests (Deterministic)	Evals (Probabilistic)
Input	Static parameters	Datasets (golden sets)
Output	`Assert.Equal(expected, actual)`	Scoring (0.0 to 1.0) via rubrics
Error pattern	Crash or logic error	Drift, hallucination, bias
Runtime	Milliseconds	Seconds to minutes (API calls)
Gate logic	100% pass required	Average scores & regression thresholds

A Minimal Setup

Treat prompts, retrieval parameters, and model routing like production code. Here’s a small checklist you can use for every LLM-related pull request:

✅ LLM Feature PR Checklist

Dataset (Golden Set): At least 20–50 real-world test cases (inputs + desired outcomes), versioned as JSONL in the repo.
Critical Failure Modes: Are edge cases defined (e.g., “What happens if no data is found?”)?
Rubric: Are there clear criteria for a “fail” (e.g., mentions of competing products, data leaks)?
CI Gate: Are evaluations automated? Is a report generated that shows whether the change improved or worsened performance?

Conclusion

The switch to eval-first is the moment you stop hoping for “AI magic” and start doing software engineering. In a follow-up post, I’ll show a minimal eval pipeline using Microsoft.Extensions.AI.

Why LLM Testing Is Tricky#

1. Non-determinism as a feature#

2. Quality is a vector, not a Boolean#

3. The “silent regression”#

Evals vs. Unit Tests: A Comparison#

A Minimal Setup#

✅ LLM Feature PR Checklist#

Conclusion#