In Python, libraries like LLMLingua are a well-known option for prompt compression. In .NET, we do not really have a direct equivalent yet — but we do have the building blocks to implement the same pattern.
The Problem: The “Token Tax”
Sending 10,000 tokens of retrieved documentation to a premium model on every query increases both cost and latency. Most of that context is boilerplate: HTML tags, redundant headers, repeated navigation, or irrelevant paragraphs.
The Solution: Two Architectural Paths
1. The “Cheap Model” Summarizer
Instead of sending raw data to your premium model, use a smaller, cheaper worker model to pre-process the context.
If you use Semantic Kernel, you can pipe your RAG results through a local Phi model via ONNX Runtime GenAI or a smaller hosted model first. Use a prompt like: “Extract only the essential technical facts and identifiers from this context for a RAG system. Remove all prose.”
2. The Middleware Pattern
Microsoft.Extensions.AI is a good fit for this pattern because IChatClient supports pipeline-style composition. You can implement a DelegatingChatClient that cleans or compresses context before the request hits the actual model client.
using Microsoft.Extensions.AI;
public sealed class ContextCompressionChatClient(IChatClient innerClient)
: DelegatingChatClient(innerClient)
{
public override async Task<ChatResponse> GetResponseAsync(
IEnumerable<ChatMessage> messages,
ChatOptions? options = null,
CancellationToken cancellationToken = default)
{
// 1. Strip boilerplate (HTML cleanup, repeated headers, etc.)
// 2. Filter low-value RAG chunks
// 3. Optional: call a smaller model to compress the context
var compressedMessages = CompressContext(messages);
return await base.GetResponseAsync(
compressedMessages,
options,
cancellationToken);
}
}
Why this helps
| Feature | Why it matters |
|---|---|
| Lower Latency | Fewer input tokens usually means faster requests and better time-to-first-token. |
| Cost Control | You stop paying premium-model prices for low-value text. |
| Clean Architecture | Your business logic stays prompt-agnostic. Compression happens in the pipeline. |