Microsoft Agent Framework Multimodal Agents: Images, PDFs, and Provider Differences

Table of Contents

In the previous article, we looked at RAG for agents. The main point was that retrieval is not magic. It is an application capability that gives the agent selected context at the point where it needs it.

Multimodal inputs hit the same kind of boundary.

Agent Framework can pass non-text content such as images through the agent call. That does not mean every provider, every model, and every file type behaves the same way. The framework gives you a common message model. The provider still decides what it can understand, how it accepts the content, how much it costs, and where the edge cases are.

My rule is simple: send the file to the model when seeing the file is the point. Preprocess it yourself when the product depends on repeatability, scale, indexing, access control, or exact extraction.

Multimodal is message content

In Agent Framework, the basic idea is not complicated. The user message can contain multiple content parts. One part can be text. Another part can be an image URI. Another part can be binary data with a media type.

That shape comes from the Microsoft.Extensions.AI content model. An agent does not only receive a string. It can receive a ChatMessage with a list of AIContent items.

A minimal image example looks like this:

using Microsoft.Agents.AI;
using Microsoft.Extensions.AI;

AIAgent agent = chatClient.AsAIAgent(
    name: "VisionAgent",
    instructions: """
    You inspect images for engineering-relevant details.
    Be specific about what is visible.
    Say when something cannot be determined from the image.
    """);

ChatMessage message = new(ChatRole.User, [
    new TextContent("What operational issue is visible in this screenshot?"),
    new UriContent(
        "https://example.com/support-ticket-screenshot.png",
        "image/png")
]);

AgentResponse response = await agent.RunAsync(
    message,
    cancellationToken: cancellationToken);

For a local image, use DataContent instead of a public URL:

byte[] imageBytes = await File.ReadAllBytesAsync(
    "screenshots/deployment-error.png",
    cancellationToken);

ChatMessage message = new(ChatRole.User, [
    new TextContent("""
    Read the screenshot and extract:
    - the visible error message
    - the affected service name, if visible
    - the next diagnostic step

    Do not guess hidden text.
    """),
    new DataContent(imageBytes, "image/png")
]);

This is the clean part. The application builds a message. The agent runs. The model analyzes the image.

The messy part starts when you ask whether the same code works across all providers, all models, all image sizes, all PDFs, and all deployment environments. It does not.

Framework support is not model support

Agent Framework can represent multimodal content. That is different from guaranteeing that the selected model can process it.

There are three separate questions:

Can the framework represent the content?
Can the provider adapter send that content to the provider?
Can the deployed model understand that content in the way your task requires?

The first answer may be yes while the second or third answer is no.

For example, UriContent can represent hosted content. DataContent can represent binary content with a media type. But a text-only model will still not understand an image. A local provider may require base64 image arrays. A hosted provider may support image URLs but not private intranet URLs. A model may read a chart well enough for a summary but poorly enough for exact values.

Treat multimodal support as a capability of the whole route:

Agent Framework message
-> provider adapter
-> provider API
-> deployed model
-> safety filters and limits
-> response shape

If any part of that route does not support the modality, the abstraction cannot fix it.

What works natively

Images are the most straightforward native multimodal input. For vision-capable models, the normal pattern is:

text instruction + image input -> text response

That is useful for:

screenshot triage
UI review
handwritten note transcription
simple chart explanation
document page inspection
comparing a small number of images
extracting visible fields from a known image type

In an agent application, this usually becomes a focused capability:

User uploads screenshot
-> application validates and stores it
-> agent receives the screenshot plus a specific question
-> agent explains what is visible
-> application logs the input, output, model, and prompt version

Keep the prompt specific. Do not ask the model to “analyze this image” unless a broad description is actually the product. Ask for the fields, risks, differences, or decisions you need.

For example:

Inspect this Azure deployment screenshot.

Return:
1. The visible resource name
2. The visible error or warning
3. Whether the screenshot contains enough information to identify the failing step
4. The next diagnostic command or portal area to check

If a value is not visible, write "not visible".

That instruction gives the model a narrow job. It also gives the rest of your application something predictable to consume.

PDFs are different

PDFs are more than large images. They are containers.

A PDF may contain:

embedded text
scanned page images
tables
charts
vector graphics
form fields
annotations
multiple columns
rotated pages
hidden or duplicated text layers

That means “read this PDF” can mean several different things.

For a born-digital PDF, text extraction may be enough. For a scanned PDF, you need OCR or vision. For a report with charts and diagrams, the visual layout may matter. For an invoice, you may need exact field extraction and validation. For a 300-page manual, you probably need ingestion, chunking, retrieval, and citations.

Some provider APIs can accept PDFs directly. For example, OpenAI’s file input flow can put both extracted text and page images into model context for compatible vision models. Claude has PDF support where pages can be processed visually, with platform-specific behavior. Gemini documents PDF processing explicitly and treats PDF pages as visual inputs with its own size, page, and token rules.

That support is useful. It does not remove the need to design the document path.

Native PDF input is a good fit when:

the document is small enough for one request
the question is about the whole document or a few known pages
visual layout matters
you do not need to index the document for repeated search
you can tolerate provider-specific behavior

Manual preprocessing is usually better when:

you process many documents
documents must be searchable later
access control matters per document or per section
extraction must be repeatable
you need stable citations or page references
you need validation against business rules
you need to normalize tables, forms, or fields
cost and latency must be predictable

For production systems, I would rarely make “send the whole PDF to the agent” the default. I would first decide what kind of document problem I actually have.

Provider differences matter

Provider differences show up in product behavior, not only in request syntax.

Provider docs change quickly. For this post, I checked the June 2026 docs and would still verify these points before shipping:

Area	What varies	Why it matters
Supported modalities	Text, image, PDF, audio, video, or only some of them	The same agent design may not work with another model
Input transport	URL, base64/data URI, uploaded file ID, provider file API, local model-specific fields	Your storage and upload flow depend on this
PDF handling	Text extraction only, visual page processing, or both	Charts, scans, and layout may be ignored in some modes
Limits	File size, image count, page count, request payload, resolution, context window	Large files may fail or become expensive
Resizing	Providers may resize images before model processing	Small text, coordinates, and fine UI details can become unreliable
Cost model	Images and PDF pages often count differently from text	A cheap text flow can become expensive with pages rendered as images
Structured output	JSON/schema/tool support may differ with multimodal inputs	Extraction pipelines need tests per provider
Data handling	Retention, region, third-party terms, and hosted URL access vary	Enterprise boundaries and compliance reviews depend on this

The mistake is hiding these differences too early.

A provider-neutral interface is useful for your application. But if the interface pretends that every provider handles images and PDFs the same way, it will leak in production.

A better abstraction exposes the parts that matter:

public sealed record MultimodalModelCapabilities(
    bool SupportsImageInput,
    bool SupportsPdfInput,
    bool SupportsImageUrls,
    bool SupportsBinaryContent,
    int? MaxImagesPerRequest,
    long? MaxRequestBytes);

This does not need to be perfect. It just needs to stop the rest of the application from assuming capabilities that are not there.

Keep the agent away from raw uploads

Do not let the agent decide what to do with arbitrary uploaded files directly.

The application should own the upload boundary:

upload
-> authenticate user
-> authorize document access
-> validate content type
-> scan or reject unsafe files
-> store original file
-> create derived artifacts
-> pass selected artifacts to the agent

Derived artifacts might include:

extracted text
OCR text
rendered page images
thumbnails
page count
document metadata
table extraction output
classification result
chunk IDs for retrieval

Then the agent receives only what it needs.

For example:

public sealed record PreparedAttachment(
    string AttachmentId,
    string MediaType,
    string? ExtractedText,
    IReadOnlyList<PreparedPageImage> PageImages);

public sealed record PreparedPageImage(
    int PageNumber,
    byte[] ImageBytes,
    string MediaType);

Now the application can decide:

send only page 3 because retrieval found it
send extracted text without images
send a rendered image because the page is scanned
reject the file because it is too large
require a specialist parser because it is an invoice

That boundary is boring, but it is the one you want. The alternative is handing the model every uploaded file and hoping it does the right thing.

Native vision is not OCR infrastructure

Vision models can read text in images. That does not make them a full OCR platform.

OCR and document processing systems usually give you things that a general multimodal model may not:

stable coordinates
confidence scores
table structures
key-value pairs
page-level metadata
repeatable extraction
deterministic post-processing
specialized document models
batch processing controls

If your product needs those properties, use a document processing pipeline first. Then give the agent the normalized output.

For example:

invoice PDF
-> document parser extracts fields
-> application validates totals and tax IDs
-> agent explains exceptions to the user

The agent is useful at the explanation layer. It is not the system of record for the invoice total.

The same applies to tables. If you need exact row counts, amounts, IDs, or totals, parse the table with deterministic code where possible. Use the agent to explain the result, handle ambiguous cases, or route exceptions.

Use tools for document operations

If multimodal work becomes more than a single model call, expose it as a tool.

For example:

public interface IDocumentInspection
{
    Task<DocumentInspectionResult> InspectAsync(
        string attachmentId,
        string question,
        UserDocumentScope scope,
        CancellationToken cancellationToken);
}

The agent does not need direct access to blob storage, PDF rendering, OCR, or every uploaded file. It needs a controlled operation:

using System.ComponentModel;
using Microsoft.Extensions.DependencyInjection;

[Description("Inspects an uploaded document or image that the current user is allowed to access.")]
public static Task<DocumentInspectionResult> InspectDocumentAsync(
    [Description("The attachment ID shown in the current conversation.")]
    string attachmentId,
    [Description("The specific question to answer about the attachment.")]
    string question,
    IServiceProvider services,
    CancellationToken cancellationToken)
{
    var userScope = services.GetRequiredService<UserDocumentScope>();
    var inspection = services.GetRequiredService<IDocumentInspection>();

    return inspection.InspectAsync(
        attachmentId,
        question,
        userScope,
        cancellationToken);
}

This keeps the same design rule from the earlier tool and RAG posts: expose controlled capabilities, not raw infrastructure.

The tool can choose the right path internally:

call a native multimodal model
extract text and call a text model
render selected pages and send images
call a document intelligence service
reject unsupported files
ask the user for a narrower page range

The agent asks for document inspection. The application decides how document inspection works.

Logging matters more with files

Text prompts are already worth logging. File inputs make logging more important.

For multimodal agent calls, log at least:

model and provider
prompt version
attachment ID
media type
file size
page count, if applicable
selected pages or images sent
preprocessing path used
provider request ID, if available
token usage or image/page usage
safety or content-filter result

Do not log raw confidential file contents by default. Log identifiers, derived metadata, and enough trace data to reproduce the path with proper access.

Without this, debugging becomes guesswork:

The agent got the PDF wrong.

is not actionable.

You need to know whether:

the wrong page was selected
OCR failed
the model never received the chart
the provider used text extraction only
the image was resized too aggressively
the prompt asked for an exact value the image could not support
the model guessed instead of saying “not visible”

Multimodal bugs often come from the pipeline around the model.

When I would use native multimodal input

I would use native image or PDF input when:

the user asks about one image or a small set of images
the document is short
visual layout is part of the question
the task is exploratory or assistive
exact extraction is not the only success criterion
provider lock-in is acceptable for that feature
latency and cost are acceptable after testing with real files

Examples:

Explain what is wrong in this deployment screenshot.
Compare these two UI screenshots.
Summarize the visible changes in this architecture diagram.
Identify which page in this short PDF contains the approval signature.
Extract draft fields from a single scanned form for human review.

Native multimodal input is strongest when seeing the artifact is the point.

When I would preprocess manually

I would preprocess manually when:

documents are long
documents are processed repeatedly
results need citations
extraction needs validation
fields need to become database records
users have different access to different files
the same corpus is queried many times
cost needs to be predictable
provider portability matters

Examples:

Search across a policy library.
Extract invoice fields into an ERP workflow.
Answer questions over thousands of manuals.
Review contracts with section-level citations.
Compare document versions.
Validate forms against business rules.

In those cases, use a pipeline:

document
-> extract text and metadata
-> OCR or render pages only when needed
-> chunk and index
-> retrieve relevant sections
-> send selected text/images to the agent
-> validate and store structured results

The agent still matters. It just should not own the entire document pipeline.

Conclusion

Multimodal agents are useful, but the abstraction boundary is easy to overestimate.

Agent Framework can pass images and other content through the agent message model. The provider and model decide what actually works. PDFs add another layer because they may be text documents, image documents, layout documents, or all three at once.

The design I would carry forward is:

use native multimodal input for small, visual, interactive tasks
preprocess files when you need repeatability, scale, search, validation, or access control
model provider capabilities explicitly
keep raw uploads behind application-owned boundaries
use tools for document operations instead of exposing storage directly
log what happened to the file, not only the final answer

That keeps the agent useful without turning it into a fragile file-processing backend.

In the next post, I will move from files to actions. Human-in-the-loop agents are about the moment where the agent should stop and ask before it does something risky or hard to undo. That includes approval flows, side effects, workflow state, and the part that is easy to miss: human review is a system boundary, not a nicer prompt.

Multimodal is message content#

Framework support is not model support#

What works natively#

PDFs are different#

Provider differences matter#

Keep the agent away from raw uploads#

Native vision is not OCR infrastructure#

Use tools for document operations#

Logging matters more with files#

When I would use native multimodal input#

When I would preprocess manually#

Conclusion#

Further reading#

Multimodal is message content

Framework support is not model support

What works natively

PDFs are different

Provider differences matter

Keep the agent away from raw uploads

Native vision is not OCR infrastructure

Use tools for document operations

Logging matters more with files

When I would use native multimodal input

When I would preprocess manually

Conclusion

Further reading