CancellationToken is one of the most underrated AI engineering features in .NET. Not because it is new. Because AI workloads have a different runtime profile.

A normal application call might take milliseconds. An LLM call might take seconds. A streaming response might keep running while tokens are generated. An embedding pipeline might process thousands of chunks. A tool call might trigger another slow network request. And sometimes, the user is already gone. They closed the tab. They navigated away. The HTTP request timed out. The background job was stopped. The deployment is shutting down.

Without cancellation, your application may keep doing expensive work nobody needs anymore. That means:

  • wasted tokens
  • wasted compute
  • unnecessary tool calls
  • slower shutdowns
  • noisy traces
  • worse resource usage

In .NET, this is not a special AI problem. It is a normal engineering problem that becomes much more visible in AI systems.

Pass the CancellationToken.

From your ASP.NET Core endpoint. From HttpContext.RequestAborted. Into your agent call. Into your IChatClient call. Into your embedding generation. Into your retrieval layer. Into your database query. Into your tool execution.

Especially when streaming responses with IAsyncEnumerable, because the UI might stop listening long before your backend stops generating.

AI engineering is not only about better prompts, better models, or better frameworks. It is also about respecting the lifecycle of the request.

Why AI Workloads Expose This Problem

Cancellation exists in .NET because long-running work needs a cooperative way to stop. That has always mattered. But classic CRUD workloads often hide the problem. If a request reads one row from a database and returns in 30 milliseconds, the cost of ignoring cancellation is small. It is still wrong, but it is rarely dramatic.

AI workloads change that. An LLM call can hold an outbound HTTP connection open for seconds. A streaming chat endpoint can continue producing tokens even after the browser tab is closed. A RAG request can do retrieval, reranking, prompt construction, and model generation before the user sees anything useful. An ingestion job can generate embeddings for thousands of chunks. An agent can call tools that call other APIs.

The runtime profile is wider, slower, and more expensive. That is why cancellation stops being a small cleanup detail. It becomes part of the cost and reliability model.

The Mental Model

Sequence

A CancellationToken is not a timeout button. It is not a thread abort. It does not magically undo side effects. It also does not guarantee that a remote provider stops work or billing instantly. It is a cooperative signal. The caller says: “This work is no longer needed.” The callee decides where it can stop safely.

That distinction matters in AI systems because work often crosses several boundaries:

  1. HTTP request
  2. retrieval
  3. embedding or reranking
  4. model call
  5. streaming response
  6. tool execution
  7. logging and tracing

If the token disappears at any point, the rest of the pipeline may continue running.

The common failure is not that developers forgot cancellation exists. The common failure is that cancellation only exists at the first method signature.

Start at the HTTP Boundary

In ASP.NET Core, a CancellationToken parameter on an endpoint is bound to the request-aborted token. That is usually the first token you want.

app.MapPost("/ask", async (
    AskRequest request,
    IChatClient chatClient,
    CancellationToken cancellationToken) =>
{
    var messages = new[]
    {
        new ChatMessage(ChatRole.System, "Answer briefly and use the provided context."),
        new ChatMessage(ChatRole.User, request.Question)
    };

    var response = await chatClient.GetResponseAsync(
        messages,
        cancellationToken: cancellationToken);

    return Results.Ok(response.Text);
});

The important part is not the endpoint syntax. The important part is that the token crosses the model boundary.

If the user disconnects while the model is generating, your application should not keep waiting for the answer just so it can throw it away.

In a larger application, I usually pass the token into an application service rather than calling the model directly in the endpoint.

app.MapPost("/ask", async (
    AskRequest request,
    AssistantService assistant,
    CancellationToken cancellationToken) =>
{
    var answer = await assistant.AnswerAsync(
        request.Question,
        cancellationToken);

    return Results.Ok(answer);
});

Then the service owns the AI workflow, but the request still owns the lifecycle.

public sealed class AssistantService(
    IRetrievalService retrieval,
    IChatClient chatClient)
{
    public async Task<string> AnswerAsync(
        string question,
        CancellationToken cancellationToken)
    {
        var context = await retrieval.SearchAsync(
            question,
            cancellationToken);

        var messages = BuildMessages(question, context);

        var response = await chatClient.GetResponseAsync(
            messages,
            cancellationToken: cancellationToken);

        return response.Text;
    }
}

That is the pattern:

  • accept the token at the boundary
  • pass it to every async operation
  • do not replace it with CancellationToken.None
  • do not stop passing it once the code enters the AI layer

Streaming Makes Cancellation More Important

Streaming is where ignored cancellation becomes easiest to miss. The backend can keep generating tokens even after the browser, mobile app, or frontend stream reader is gone. From the user’s perspective, the conversation ended. From the backend’s perspective, the model may still be working. That is wasted work. For streaming endpoints, pass the token into the model call, and optionally into response writes when it helps the write loop exit cleanly.

app.MapGet("/ask/stream", async (
    string question,
    IChatClient chatClient,
    HttpResponse response,
    CancellationToken cancellationToken) =>
{
    response.ContentType = "text/plain; charset=utf-8";

    var messages = new[]
    {
        new ChatMessage(ChatRole.User, question)
    };

    await foreach (var update in chatClient
        .GetStreamingResponseAsync(
            messages,
            cancellationToken: cancellationToken))
    {
        if (string.IsNullOrEmpty(update.Text))
        {
            continue;
        }

        await response.WriteAsync(update.Text, cancellationToken);
        await response.Body.FlushAsync(cancellationToken);
    }
});

There are two useful details here. First, the model streaming call receives the token. Second, each response write receives the same token. That matters because streaming is not one operation. It is a sequence.

Each token chunk is another opportunity to stop.

If the user interface stops listening, the backend should notice. If the request is aborted, the model stream should stop. If the deployment is shutting down, the endpoint should not keep a long stream alive just because it already started.

Passing the token to WriteAsync and FlushAsync is fine, but the more important part is usually upstream cancellation. Once the client disconnects, ASP.NET Core may already stop or fail response writes. The expensive work is the model call, retrieval, tool execution, or embedding generation that keeps producing data for a response nobody will read.

If you consume a streaming API that does not expose a cancellation token parameter, use .WithCancellation(cancellationToken) at the enumeration site instead. Do not pass the same token through both mechanisms unless the API documentation explicitly expects that pattern.

RAG Pipelines Need the Same Discipline

RAG systems often hide several expensive operations behind one “ask” endpoint.

A single user question might do this:

  1. rewrite the query
  2. generate an embedding
  3. search a vector index
  4. fetch source documents
  5. rerank results
  6. build the prompt
  7. call the model
  8. stream the answer

Cancellation needs to travel through that whole chain.

public sealed class RagAssistant(
    IQueryRewriter rewriter,
    IRetriever retriever,
    IChatClient chatClient)
{
    public async Task<string> AnswerAsync(
        string question,
        CancellationToken cancellationToken)
    {
        var rewrittenQuery = await rewriter.RewriteAsync(
            question,
            cancellationToken);

        var documents = await retriever.SearchAsync(
            rewrittenQuery,
            cancellationToken);

        var messages = BuildPrompt(question, documents);

        var response = await chatClient.GetResponseAsync(
            messages,
            cancellationToken: cancellationToken);

        return response.Text;
    }
}

This looks boring. That is the point.

Cancellation in AI systems should not require a clever framework. It should be part of the ordinary method contract.

If retrieval is backed by EF Core, pass the token there too.

public Task<List<DocumentChunk>> LoadChunksAsync(
    IReadOnlyCollection<string> chunkIds,
    CancellationToken cancellationToken)
{
    return db.DocumentChunks
        .Where(chunk => chunkIds.Contains(chunk.Id))
        .ToListAsync(cancellationToken);
}

If retrieval is backed by a vector database or search service, the same rule applies. The outbound SDK call should receive the token.

Embedding Jobs Are Cancellation Hotspots

Embedding generation is another place where cancellation gets ignored.

It often runs outside the request path, so developers treat it as batch work that can simply run until it finishes. Sometimes that is fine. But ingestion jobs still need to stop cleanly during deployment, shutdown, or operational intervention.

If you process thousands of chunks, check cancellation between batches.

public async Task IndexDocumentsAsync(
    IEnumerable<DocumentChunk> chunks,
    IEmbeddingGenerator<string, Embedding<float>> embeddingGenerator,
    IVectorIndex vectorIndex,
    CancellationToken cancellationToken)
{
    foreach (var batch in chunks.Chunk(64))
    {
        cancellationToken.ThrowIfCancellationRequested();

        var texts = batch
            .Select(chunk => chunk.Text)
            .ToArray();

        var embeddings = await embeddingGenerator.GenerateAsync(
            texts,
            cancellationToken: cancellationToken);

        await vectorIndex.UpsertAsync(
            batch,
            embeddings,
            cancellationToken);
    }
}

The batch boundary is a natural safe point.

You do not want to stop halfway through a local in-memory projection just because a token was canceled. But you do want to stop before generating the next expensive batch of embeddings.

For ingestion pipelines, cancellation is also useful for shutdown behavior. A background service that ignores its stopping token can make deployments slower and less predictable.

public sealed class EmbeddingWorker(
    EmbeddingQueue queue,
    DocumentIndexer indexer) : BackgroundService
{
    protected override async Task ExecuteAsync(
        CancellationToken stoppingToken)
    {
        await foreach (var job in queue.ReadAllAsync(stoppingToken))
        {
            await indexer.IndexAsync(job, stoppingToken);
        }
    }
}

The stoppingToken is not decoration. It is the host telling your worker that the process is trying to stop.

Tools Need Cancellation Too

Tool calling makes this more important, not less.

An agent tool is still application code. It might query a database, call an internal API, invoke a search service, read files, or trigger another model call. If the parent request is canceled, the tool should not keep doing unnecessary work.

[Description("Searches internal documentation for relevant snippets.")]
public static Task<IReadOnlyList<SearchResult>> SearchDocsAsync(
    string query,
    IServiceProvider services,
    CancellationToken cancellationToken)
{
    var search = services.GetRequiredService<IDocumentSearch>();

    return search.SearchAsync(query, cancellationToken);
}

The model does not need to know about the token. Your application does.

This is an important boundary. Model-supplied arguments are untrusted input. The CancellationToken comes from your runtime. Do not let agent abstractions make you forget that tool execution still belongs to your application lifecycle.

This assumes your tool framework treats IServiceProvider and CancellationToken as runtime-supplied parameters, not model-supplied parameters. If a framework exposes every method parameter to the model schema, do not expose application services or lifecycle tokens that way.

Timeouts Are Policy, Cancellation Is Plumbing

Cancellation tokens are often used to implement timeouts, but they are not the same thing.

A timeout is a policy decision.

For example:

  • this chat endpoint should stop after 30 seconds
  • this retrieval call should stop after 2 seconds
  • this embedding batch should stop after 5 minutes
  • this background worker should stop when the host shuts down

The token is how that decision travels through the code.

If you need a request cancellation token and an internal timeout, link them at the edge where the policy is visible.

app.MapPost("/ask", async (
    AskRequest request,
    AssistantService assistant,
    CancellationToken requestAborted) =>
{
    using var timeoutCts = new CancellationTokenSource(
        TimeSpan.FromSeconds(30));

    using var linkedCts = CancellationTokenSource.CreateLinkedTokenSource(
        requestAborted,
        timeoutCts.Token);

    try
    {
        var answer = await assistant.AnswerAsync(
            request.Question,
            linkedCts.Token);

        return Results.Ok(answer);
    }
    catch (OperationCanceledException) when (requestAborted.IsCancellationRequested)
    {
        // 499 Client Closed Request is a common convention,
        // not an ASP.NET Core named status constant.
        return Results.StatusCode(499);
    }
    catch (OperationCanceledException) when (timeoutCts.IsCancellationRequested)
    {
        return Results.StatusCode(StatusCodes.Status504GatewayTimeout);
    }
});

Flowchart

This keeps two cases separate:

  • the user went away
  • your system decided the operation took too long

Those should not be logged, alerted, or retried in the same way.

What Not To Do

The mistake I look for first is CancellationToken.None.

await chatClient.GetResponseAsync(
    messages,
    cancellationToken: CancellationToken.None);

That says: “Even if the caller is gone, keep going.”

Sometimes that is intentional. Most of the time, it is accidental.

Another mistake is accepting a token but only using it in the first call.

public async Task<string> AnswerAsync(
    string question,
    CancellationToken cancellationToken)
{
    var documents = await retriever.SearchAsync(question, cancellationToken);

    var response = await chatClient.GetResponseAsync(
        BuildPrompt(question, documents));

    return response.Text;
}

The retrieval call can cancel. The model call cannot.

That is exactly the wrong place to lose the token, because the model call is often the slowest and most expensive part of the operation.

Logging Cancellation Like a Failure Creates Noise

Expected cancellation is not the same as failure.

If a user closes a browser tab while a streaming answer is being generated, that is not a model outage. If the host is shutting down and a background embedding worker stops between batches, that is not an ingestion error.

Log cancellation separately.

try
{
    await assistant.AnswerAsync(question, cancellationToken);
}
catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)
{
    logger.LogInformation(
        "AI request canceled before completion.");

    throw;
}

The throw is intentional. The method can add local context, but the boundary should decide how cancellation maps to the transport response, trace status, or job state.

For AI workloads, this matters because traces can get noisy quickly. If every user-aborted stream looks like an application error, your observability gets worse instead of better.

A Checklist for .NET AI Code

When I review AI code in .NET, I check this:

  • Does the ASP.NET Core endpoint accept a CancellationToken?
  • Is HttpContext.RequestAborted used when the token is not injected directly?
  • Does the token reach the agent or IChatClient call?
  • Does streaming use cancellation while consuming IAsyncEnumerable?
  • Does the token reach embedding generation?
  • Does retrieval pass the token into search, vector, and database calls?
  • Do tools accept and pass the token?
  • Do background services use stoppingToken?
  • Are timeout policies explicit and close to the boundary?
  • Are linked token sources disposed?
  • Is CancellationToken.None used only where the reason is intentional?
  • Are expected cancellations logged differently from real failures?

Most of this is not AI-specific syntax. It is ordinary .NET discipline applied to AI runtime behavior.

When To Use Cancellation Tokens

Use cancellation tokens in AI systems when:

  • a user request may be aborted
  • a streaming response may be disconnected
  • a model call may run longer than the user is willing to wait
  • retrieval or reranking has a request-level deadline
  • embedding generation runs in batches
  • tools call databases, APIs, or other slow dependencies
  • background workers need to stop cleanly during deployment
  • retries should stop when the parent operation is no longer relevant

When Not To Rely On Cancellation Tokens

Do not use cancellation tokens as:

  • a substitute for idempotency
  • a substitute for transactions
  • a replacement for retry and circuit-breaker policy
  • a guarantee that remote providers stopped billing instantly
  • a way to pretend side effects can be undone
  • a reason to swallow OperationCanceledException and return success

Cancellation is not the whole reliability story. It is the lifecycle signal that lets the rest of the story behave correctly.

Final Thoughts

The forgotten AI feature in .NET is not flashy. It is probably already in your method signature: CancellationToken. In small CRUD paths, ignoring it might only waste a few milliseconds. In AI systems, ignoring it can waste model calls, tokens, tool executions, embedding batches, and shutdown time.

Better prompts and better models matter. But so does respecting the lifecycle of the request. Pass the token.

Official References