Quick Tip 3 - Context Compression in .NET

In Python, libraries like LLMLingua are a well-known option for prompt compression. In .NET, we do not really have a direct equivalent yet — but we do have the building blocks to implement the same pattern.

The Problem: The “Token Tax”

Sending 10,000 tokens of retrieved documentation to a premium model on every query increases both cost and latency. Most of that context is boilerplate: HTML tags, redundant headers, repeated navigation, or irrelevant paragraphs.

The Solution: Two Architectural Paths

1. The “Cheap Model” Summarizer

Instead of sending raw data to your premium model, use a smaller, cheaper worker model to pre-process the context.

If you use Semantic Kernel, you can pipe your RAG results through a local Phi model via ONNX Runtime GenAI or a smaller hosted model first. Use a prompt like: “Extract only the essential technical facts and identifiers from this context for a RAG system. Remove all prose.”

2. The Middleware Pattern

Microsoft.Extensions.AI is a good fit for this pattern because IChatClient supports pipeline-style composition. You can implement a DelegatingChatClient that cleans or compresses context before the request hits the actual model client.

using Microsoft.Extensions.AI;

public sealed class ContextCompressionChatClient(IChatClient innerClient)
    : DelegatingChatClient(innerClient)
{
    public override async Task<ChatResponse> GetResponseAsync(
        IEnumerable<ChatMessage> messages,
        ChatOptions? options = null,
        CancellationToken cancellationToken = default)
    {
        // 1. Strip boilerplate (HTML cleanup, repeated headers, etc.)
        // 2. Filter low-value RAG chunks
        // 3. Optional: call a smaller model to compress the context
        var compressedMessages = CompressContext(messages);

        return await base.GetResponseAsync(
            compressedMessages,
            options,
            cancellationToken);
    }
}

Why this helps

Feature	Why it matters
Lower Latency	Fewer input tokens usually means faster requests and better time-to-first-token.
Cost Control	You stop paying premium-model prices for low-value text.
Clean Architecture	Your business logic stays prompt-agnostic. Compression happens in the pipeline.

The Problem: The “Token Tax”#

The Solution: Two Architectural Paths#

1. The “Cheap Model” Summarizer#

2. The Middleware Pattern#

Why this helps#

The Problem: The “Token Tax”

The Solution: Two Architectural Paths

1. The “Cheap Model” Summarizer

2. The Middleware Pattern

Why this helps