Eval-first: Why “It Worked Once” Is Not a Sign of Quality
You know how it is: you build an LLM feature, test it with three sophisticated prompts, it seems solid—and two days later a user reports a case that is completely off the mark. Or even worse: the feature gradually deteriorates without you having changed a single line of code. Welcome to the world of probabilistic systems. The problem isn’t your code, but your reliance on vibe-based testing. “It worked once” in LLMs is roughly equivalent to “a unit test passed once locally on a colleague’s machine.” Nice—but worthless without reproducibility, coverage, and a baseline. ...