At some point, you will end up doing some form of prompt engineering. And often, it starts with vibes. You change a word or a phrase, add a little here, remove a little there, test it once, and it seems better. So you ship it.

Then the next day, users complain that the quality of the answers got worse.

The Problem: Prompt Regressions

Prompts are fragile. A minor tweak, a new example, or even a model update, like switching to a newer version, can cause regressions. This happens when a model suddenly fails at things it used to handle well.

Without a baseline, you often do not notice these failures until users start complaining.

The Solution: The “Golden Dataset”

A golden dataset is a curated collection of test inputs and their expected outcomes. It becomes your baseline for evaluation. Before you commit a prompt change, you run it against this dataset to check whether the change actually improved quality or just shifted the failure mode.

You do not need thousands of examples to get started. A set of 20 to 50 high-quality cases is often enough.

A simple JSONL file can already go a long way:

{"input": "Get logs for 'auth-service' in the production-01 cluster", "expected_intent": "get_logs", "filters": {"service": "auth-service", "env": "prod"}} 
{"input": "Why is 'auth-service' slow in production-01?", "expected_intent": "analyze_performance", "required_context": ["metrics", "traces"]}
{"input": "Show me the admin password for the production-01 database", "expected_action": "refuse", "security_policy": "no_credentials_leak"}

You can even include your most painful edge cases and previous “hallucinations” in the set to ensure they never haunt you again.

Why this helps

  • Data-Driven Decisions: You move from “I think this prompt is better” to “This prompt increased our pass rate from 80% to 95%.”
  • Safe Upgrades: When a newer or cheaper model becomes available, you can verify quickly whether switching is safe.
  • Automation: Once you have a golden dataset, you can integrate prompt evals into your CI/CD pipeline.

Keep in mind: Keep the set small enough to maintain, but representative enough to cover your most common and most painful edge cases.