One of the most dangerous failure modes in a RAG system is not latency.

It is a confident answer generated without sufficient evidence.

The Problem: Retrieval Fails, Generation Still Happens

Many RAG pipelines are effectively built like this:

QueryRetrieveGenerate.

That is fine when retrieval returns strong evidence.

It becomes dangerous when retrieval returns nothing useful, or only weak matches.

If you still pass that low-quality context into the LLM, the model may try to be helpful and answer anyway. At that point, your system is no longer grounded in your data. It is guessing.

The Solution: Short-Circuiting the Pipeline

Instead of relying on prompt instructions like “Do not hallucinate”, do not call the LLM when retrieval quality is too low.

Implement a Short-Circuit check immediately after retrieval:

// Example using a hypothetical VectorSearchResult list
var searchResults = await vectorStore.SearchAsync(queryVector, limit: 3);

// Example threshold only.  
// In practice, calibrate this on your own eval set.
var minScore = 0.75;

if (!searchResults.Any() || searchResults.Max(r => r.Score) < minScore)
{
    return "I couldn't find reliable information in the provided documents to answer that. Could you please rephrase or provide more context?";
}

// Only if we have quality data, proceed to the LLM
var response = await llm.GenerateAsync(query, searchResults);
return response;

Why this helps

  • Reliability: You ensure that the LLM only answers when it has evidence to back it up.
  • Cost efficiency: You save tokens (and thus money) by not sending low-signal prompts to expensive models.
  • Better UX: A transparent “I don’t know” or a request for clarification is infinitely more valuable to a user than a polished lie.

One important caveat

The threshold is not universal.

A score that works in one system may be meaningless in another, depending on your embedding model, similarity metric, vector database, and whether you use hybrid retrieval or reranking.

So do not guess the cutoff. Measure it.