Controlling Token Growth with Chat Reducers

Table of Contents

The Token Trap in Long Chats

As we have seen in previous articles, stateless LLMs require us to continuously send the entire previous chat history so the AI can retain context.

As each message is added to ongoing chats, input tokens accumulate. Even after many previous interactions, asking a simple question like “What is 1+1?” still results in the entire conversation history being sent. This will come with its own problems, like a full context window and rising costs. To address this, the framework introduces Chat Reducers.

Message Counting

The simplest form of a Chat Reducer is “Message Counting”. Here, you define a target count. The reducer keeps the most recent messages up to that count, while preserving the first system message if present.

To use this with an agent, configure a ChatHistoryProvider, such as InMemoryChatHistoryProvider, in ChatClientAgentOptions and pass the reducer through InMemoryChatHistoryProviderOptions.

// 1. Define an IChatReducer that keeps the latest 10 non-system messages
IChatReducer messageCountReducer = new MessageCountingChatReducer(10);

// 2. Configure the agent options with an in-memory chat history provider
var agentOptions = new ChatClientAgentOptions
{
    ChatHistoryProvider = new InMemoryChatHistoryProvider(
        new InMemoryChatHistoryProviderOptions
        {
            ChatReducer = messageCountReducer
        })
};

// 3. Create your agent from an IChatClient
AIAgent agent = chatClient.AsAIAgent(agentOptions);

The major advantage is that the token count and latency drop drastically the moment the limit takes effect.

A limitation is that earlier context information is no longer available. If you share your name at the start of the conversation and refer to it after messages have been removed, the AI cannot recall it.

Summarization

A more sophisticated approach is the SummarizingChatReducer. This method uses an IChatClient to summarize older messages during reduction.

To set it up, you define the target count and an optional threshold. The target count is the number of recent messages that should remain after the reduction. The threshold controls how many messages beyond that target count are allowed before summarization is triggered.

When the conversation grows beyond targetCount + threshold, the reducer summarizes older messages. This summary replaces the old messages, while the most recent chat messages remain unchanged.

A key feature for advanced scenarios is prompt customization. The summarization prompt or logic used can be tailored to fit your needs. This allows you to adapt the summary process via the SummarizationPrompt property. This way, you can adapt the logic to your application’s domain, highlight specific information, or enforce a particular writing style, resulting in summaries that are more useful and relevant for your use case.

// 1. You need a base IChatClient to perform the summarization calls
IChatClient innerChatClient = chatClient; // e.g., Azure OpenAI, OpenAI, or Ollama
// 2. Configure the reducer
// This keeps 1 recent message after summarization.
// threshold is "messages allowed beyond targetCount", so 9 means summarization
// starts once the history grows beyond 10.
IChatReducer summaryReducer = new SummarizingChatReducer(
    chatClient: innerChatClient,
    targetCount: 1,
    threshold: 9)
{
    SummarizationPrompt =
        "Summarize the following conversation while keeping technical specs and user names."
};
// 3. Configure the agent options with the reducer
var summaryAgentOptions = new ChatClientAgentOptions
{
    ChatHistoryProvider = new InMemoryChatHistoryProvider(
        new InMemoryChatHistoryProviderOptions
        {
            ChatReducer = summaryReducer
        })
};
// 4. Create the agent
AIAgent smartAgent = chatClient.AsAIAgent(summaryAgentOptions);

A significant benefit is that details from earlier in the conversation, such as your name or instructions, are included in the summary, allowing the AI to retain relevant information.

The disadvantage is that generating this summary with the LLM also costs some tokens. Additionally, summarization introduces a slight performance impact, as the agent must pause and wait for the model to process and return the summary before proceeding. This can temporarily increase the latency for a user’s next message each time summarization is triggered. In high-traffic scenarios, frequent summarizations may also affect overall throughput. You should consider these trade-offs and test the reducer settings under expected workloads to ensure that performance remains within acceptable limits.

Tip: To keep costs and latency low, you don’t have to use your powerful main model for summarization. You can pass a smaller, faster model as the innerChatClient.

Note: The framework doesn’t provide an automatic fallback if summarization fails. A robust implementation should include a retry policy (via the IChatClient pipeline) or a custom mechanism to retain recent messages, ensuring the conversation remains fluid even in the event of, e.g., an API error.

Practical Comparison

Which reducer you choose depends heavily on your specific use case.

It is always a balancing act between the value of retaining old messages, the cost of tokens, and the model’s maximum context size.

Use pure truncation (Message Counting) for simple use cases, where old topics quickly become irrelevant.

Use Summarization for complex, in-depth agents, where the user might still want to refer back to earlier facts even after 15 minutes of chatting.

Feature	Message Counting (Truncation)	Summarization
Best For	Simple bots, high-volume support	Complex assistants, deep analysis
Context	Lost once it drops off the list	Retained in condensed form
Token Cost	Lowest (zero cost for reduction)	Moderate (costs tokens to summarize)
Complexity	Set and forget	Requires custom prompts & error handling

Conclusion

Chat Reducers let us control conversation length and token costs efficiently.

Next, we’ll explore AIContextProviders, which allow agents to dynamically inject context and extract new memories, providing persistent memory while optimizing token usage.

The Token Trap in Long Chats#

Message Counting#

Summarization#

Practical Comparison#

Conclusion#

Further Reading#

The Token Trap in Long Chats

Message Counting

Summarization

Practical Comparison

Conclusion

Further Reading