Table of Contents
In the previous article, we looked at RAG for agents. The main point was that retrieval is not magic. It is an application capability that gives the agent selected context at the point where it needs it.
Multimodal inputs hit the same kind of boundary.
Agent Framework can pass non-text content such as images through the agent call. That does not mean every provider, every model, and every file type behaves the same way. The framework gives you a common message model. The provider still decides what it can understand, how it accepts the content, how much it costs, and where the edge cases are.
My rule is simple: send the file to the model when seeing the file is the point. Preprocess it yourself when the product depends on repeatability, scale, indexing, access control, or exact extraction.
Multimodal is message content
In Agent Framework, the basic idea is not complicated. The user message can contain multiple content parts. One part can be text. Another part can be an image URI. Another part can be binary data with a media type.
That shape comes from the Microsoft.Extensions.AI content model.
An agent does not only receive a string.
It can receive a ChatMessage with a list of AIContent items.
A minimal image example looks like this:
using Microsoft.Agents.AI;
using Microsoft.Extensions.AI;
AIAgent agent = chatClient.AsAIAgent(
name: "VisionAgent",
instructions: """
You inspect images for engineering-relevant details.
Be specific about what is visible.
Say when something cannot be determined from the image.
""");
ChatMessage message = new(ChatRole.User, [
new TextContent("What operational issue is visible in this screenshot?"),
new UriContent(
"https://example.com/support-ticket-screenshot.png",
"image/png")
]);
AgentResponse response = await agent.RunAsync(
message,
cancellationToken: cancellationToken);
For a local image, use DataContent instead of a public URL:
byte[] imageBytes = await File.ReadAllBytesAsync(
"screenshots/deployment-error.png",
cancellationToken);
ChatMessage message = new(ChatRole.User, [
new TextContent("""
Read the screenshot and extract:
- the visible error message
- the affected service name, if visible
- the next diagnostic step
Do not guess hidden text.
"""),
new DataContent(imageBytes, "image/png")
]);
This is the clean part. The application builds a message. The agent runs. The model analyzes the image.
The messy part starts when you ask whether the same code works across all providers, all models, all image sizes, all PDFs, and all deployment environments. It does not.
Framework support is not model support
Agent Framework can represent multimodal content. That is different from guaranteeing that the selected model can process it.
There are three separate questions:
- Can the framework represent the content?
- Can the provider adapter send that content to the provider?
- Can the deployed model understand that content in the way your task requires?
The first answer may be yes while the second or third answer is no.
For example, UriContent can represent hosted content.
DataContent can represent binary content with a media type.
But a text-only model will still not understand an image.
A local provider may require base64 image arrays.
A hosted provider may support image URLs but not private intranet URLs.
A model may read a chart well enough for a summary but poorly enough for exact values.
Treat multimodal support as a capability of the whole route:
Agent Framework message
-> provider adapter
-> provider API
-> deployed model
-> safety filters and limits
-> response shape
If any part of that route does not support the modality, the abstraction cannot fix it.
What works natively
Images are the most straightforward native multimodal input. For vision-capable models, the normal pattern is:
text instruction + image input -> text response
That is useful for:
- screenshot triage
- UI review
- handwritten note transcription
- simple chart explanation
- document page inspection
- comparing a small number of images
- extracting visible fields from a known image type
In an agent application, this usually becomes a focused capability:
User uploads screenshot
-> application validates and stores it
-> agent receives the screenshot plus a specific question
-> agent explains what is visible
-> application logs the input, output, model, and prompt version
Keep the prompt specific. Do not ask the model to “analyze this image” unless a broad description is actually the product. Ask for the fields, risks, differences, or decisions you need.
For example:
Inspect this Azure deployment screenshot.
Return:
1. The visible resource name
2. The visible error or warning
3. Whether the screenshot contains enough information to identify the failing step
4. The next diagnostic command or portal area to check
If a value is not visible, write "not visible".
That instruction gives the model a narrow job. It also gives the rest of your application something predictable to consume.
PDFs are different
PDFs are more than large images. They are containers.
A PDF may contain:
- embedded text
- scanned page images
- tables
- charts
- vector graphics
- form fields
- annotations
- multiple columns
- rotated pages
- hidden or duplicated text layers
That means “read this PDF” can mean several different things.
For a born-digital PDF, text extraction may be enough. For a scanned PDF, you need OCR or vision. For a report with charts and diagrams, the visual layout may matter. For an invoice, you may need exact field extraction and validation. For a 300-page manual, you probably need ingestion, chunking, retrieval, and citations.
Some provider APIs can accept PDFs directly. For example, OpenAI’s file input flow can put both extracted text and page images into model context for compatible vision models. Claude has PDF support where pages can be processed visually, with platform-specific behavior. Gemini documents PDF processing explicitly and treats PDF pages as visual inputs with its own size, page, and token rules.
That support is useful. It does not remove the need to design the document path.
Native PDF input is a good fit when:
- the document is small enough for one request
- the question is about the whole document or a few known pages
- visual layout matters
- you do not need to index the document for repeated search
- you can tolerate provider-specific behavior
Manual preprocessing is usually better when:
- you process many documents
- documents must be searchable later
- access control matters per document or per section
- extraction must be repeatable
- you need stable citations or page references
- you need validation against business rules
- you need to normalize tables, forms, or fields
- cost and latency must be predictable
For production systems, I would rarely make “send the whole PDF to the agent” the default. I would first decide what kind of document problem I actually have.
Provider differences matter
Provider differences show up in product behavior, not only in request syntax.
Provider docs change quickly. For this post, I checked the June 2026 docs and would still verify these points before shipping:
| Area | What varies | Why it matters |
|---|---|---|
| Supported modalities | Text, image, PDF, audio, video, or only some of them | The same agent design may not work with another model |
| Input transport | URL, base64/data URI, uploaded file ID, provider file API, local model-specific fields | Your storage and upload flow depend on this |
| PDF handling | Text extraction only, visual page processing, or both | Charts, scans, and layout may be ignored in some modes |
| Limits | File size, image count, page count, request payload, resolution, context window | Large files may fail or become expensive |
| Resizing | Providers may resize images before model processing | Small text, coordinates, and fine UI details can become unreliable |
| Cost model | Images and PDF pages often count differently from text | A cheap text flow can become expensive with pages rendered as images |
| Structured output | JSON/schema/tool support may differ with multimodal inputs | Extraction pipelines need tests per provider |
| Data handling | Retention, region, third-party terms, and hosted URL access vary | Enterprise boundaries and compliance reviews depend on this |
The mistake is hiding these differences too early.
A provider-neutral interface is useful for your application. But if the interface pretends that every provider handles images and PDFs the same way, it will leak in production.
A better abstraction exposes the parts that matter:
public sealed record MultimodalModelCapabilities(
bool SupportsImageInput,
bool SupportsPdfInput,
bool SupportsImageUrls,
bool SupportsBinaryContent,
int? MaxImagesPerRequest,
long? MaxRequestBytes);
This does not need to be perfect. It just needs to stop the rest of the application from assuming capabilities that are not there.
Keep the agent away from raw uploads
Do not let the agent decide what to do with arbitrary uploaded files directly.
The application should own the upload boundary:
upload
-> authenticate user
-> authorize document access
-> validate content type
-> scan or reject unsafe files
-> store original file
-> create derived artifacts
-> pass selected artifacts to the agent
Derived artifacts might include:
- extracted text
- OCR text
- rendered page images
- thumbnails
- page count
- document metadata
- table extraction output
- classification result
- chunk IDs for retrieval
Then the agent receives only what it needs.
For example:
public sealed record PreparedAttachment(
string AttachmentId,
string MediaType,
string? ExtractedText,
IReadOnlyList<PreparedPageImage> PageImages);
public sealed record PreparedPageImage(
int PageNumber,
byte[] ImageBytes,
string MediaType);
Now the application can decide:
- send only page 3 because retrieval found it
- send extracted text without images
- send a rendered image because the page is scanned
- reject the file because it is too large
- require a specialist parser because it is an invoice
That boundary is boring, but it is the one you want. The alternative is handing the model every uploaded file and hoping it does the right thing.
Native vision is not OCR infrastructure
Vision models can read text in images. That does not make them a full OCR platform.
OCR and document processing systems usually give you things that a general multimodal model may not:
- stable coordinates
- confidence scores
- table structures
- key-value pairs
- page-level metadata
- repeatable extraction
- deterministic post-processing
- specialized document models
- batch processing controls
If your product needs those properties, use a document processing pipeline first. Then give the agent the normalized output.
For example:
invoice PDF
-> document parser extracts fields
-> application validates totals and tax IDs
-> agent explains exceptions to the user
The agent is useful at the explanation layer. It is not the system of record for the invoice total.
The same applies to tables. If you need exact row counts, amounts, IDs, or totals, parse the table with deterministic code where possible. Use the agent to explain the result, handle ambiguous cases, or route exceptions.
Use tools for document operations
If multimodal work becomes more than a single model call, expose it as a tool.
For example:
public interface IDocumentInspection
{
Task<DocumentInspectionResult> InspectAsync(
string attachmentId,
string question,
UserDocumentScope scope,
CancellationToken cancellationToken);
}
The agent does not need direct access to blob storage, PDF rendering, OCR, or every uploaded file. It needs a controlled operation:
using System.ComponentModel;
using Microsoft.Extensions.DependencyInjection;
[Description("Inspects an uploaded document or image that the current user is allowed to access.")]
public static Task<DocumentInspectionResult> InspectDocumentAsync(
[Description("The attachment ID shown in the current conversation.")]
string attachmentId,
[Description("The specific question to answer about the attachment.")]
string question,
IServiceProvider services,
CancellationToken cancellationToken)
{
var userScope = services.GetRequiredService<UserDocumentScope>();
var inspection = services.GetRequiredService<IDocumentInspection>();
return inspection.InspectAsync(
attachmentId,
question,
userScope,
cancellationToken);
}
This keeps the same design rule from the earlier tool and RAG posts: expose controlled capabilities, not raw infrastructure.
The tool can choose the right path internally:
- call a native multimodal model
- extract text and call a text model
- render selected pages and send images
- call a document intelligence service
- reject unsupported files
- ask the user for a narrower page range
The agent asks for document inspection. The application decides how document inspection works.
Logging matters more with files
Text prompts are already worth logging. File inputs make logging more important.
For multimodal agent calls, log at least:
- model and provider
- prompt version
- attachment ID
- media type
- file size
- page count, if applicable
- selected pages or images sent
- preprocessing path used
- provider request ID, if available
- token usage or image/page usage
- safety or content-filter result
Do not log raw confidential file contents by default. Log identifiers, derived metadata, and enough trace data to reproduce the path with proper access.
Without this, debugging becomes guesswork:
The agent got the PDF wrong.
is not actionable.
You need to know whether:
- the wrong page was selected
- OCR failed
- the model never received the chart
- the provider used text extraction only
- the image was resized too aggressively
- the prompt asked for an exact value the image could not support
- the model guessed instead of saying “not visible”
Multimodal bugs often come from the pipeline around the model.
When I would use native multimodal input
I would use native image or PDF input when:
- the user asks about one image or a small set of images
- the document is short
- visual layout is part of the question
- the task is exploratory or assistive
- exact extraction is not the only success criterion
- provider lock-in is acceptable for that feature
- latency and cost are acceptable after testing with real files
Examples:
- Explain what is wrong in this deployment screenshot.
- Compare these two UI screenshots.
- Summarize the visible changes in this architecture diagram.
- Identify which page in this short PDF contains the approval signature.
- Extract draft fields from a single scanned form for human review.
Native multimodal input is strongest when seeing the artifact is the point.
When I would preprocess manually
I would preprocess manually when:
- documents are long
- documents are processed repeatedly
- results need citations
- extraction needs validation
- fields need to become database records
- users have different access to different files
- the same corpus is queried many times
- cost needs to be predictable
- provider portability matters
Examples:
- Search across a policy library.
- Extract invoice fields into an ERP workflow.
- Answer questions over thousands of manuals.
- Review contracts with section-level citations.
- Compare document versions.
- Validate forms against business rules.
In those cases, use a pipeline:
document
-> extract text and metadata
-> OCR or render pages only when needed
-> chunk and index
-> retrieve relevant sections
-> send selected text/images to the agent
-> validate and store structured results
The agent still matters. It just should not own the entire document pipeline.
Conclusion
Multimodal agents are useful, but the abstraction boundary is easy to overestimate.
Agent Framework can pass images and other content through the agent message model. The provider and model decide what actually works. PDFs add another layer because they may be text documents, image documents, layout documents, or all three at once.
The design I would carry forward is:
- use native multimodal input for small, visual, interactive tasks
- preprocess files when you need repeatability, scale, search, validation, or access control
- model provider capabilities explicitly
- keep raw uploads behind application-owned boundaries
- use tools for document operations instead of exposing storage directly
- log what happened to the file, not only the final answer
That keeps the agent useful without turning it into a fragile file-processing backend.
In the next post, I will move from files to actions. Human-in-the-loop agents are about the moment where the agent should stop and ask before it does something risky or hard to undo. That includes approval flows, side effects, workflow state, and the part that is easy to miss: human review is a system boundary, not a nicer prompt.
Further reading
- Using images with an Agent Framework agent
- Microsoft.Extensions.AI
ChatMessage - Microsoft.Extensions.AI
UriContent - Microsoft.Extensions.AI
DataContent - Azure OpenAI vision-enabled chat models
- OpenAI image inputs
- OpenAI file inputs for PDFs
- Claude vision
- Claude PDF support
- Gemini document processing
- Ollama vision models