Agent TechnologyMulti-modalAI Agents

Computer Vision in Web Agents: When Text Is Not Enough

Agent Checker27 February 20264 min read

Most web agents start by reading the DOM. They parse HTML, extract text, and build a structural map of the page. For many sites, that is enough. But a growing number of agents also take screenshots and send them to vision models, because some information only exists in pixels.

Why Text Parsing Falls Short

Consider a restaurant website that displays its menu as a stylised image. The DOM contains an <img> tag with an alt attribute that reads "our menu." The actual dishes, prices, and descriptions are baked into the image file. A text-only agent sees a page with almost no useful content. A vision-capable agent reads the menu directly from the screenshot.

This is not an edge case. Infographics, charts, annotated diagrams, product comparison tables rendered as images, promotional banners with embedded text: the web is full of information that lives outside the DOM. Image-heavy sites present a real challenge for text-only agents.

Custom UI widgets are another common problem. A date picker built with a canvas element, a drag-and-drop interface, or a map with clickable regions may have minimal DOM representation. Vision models can identify these components by their visual appearance and understand how to interact with them.

How Vision Models Process Web Pages

When a vision-capable agent encounters a page, it typically captures a viewport screenshot at a fixed resolution (commonly 1280x720 or 1920x1080). This image gets sent to a multi-modal language model alongside the agent's current task and any relevant context.

The model processes the image in a single pass. It identifies text, buttons, form fields, navigation elements, and their spatial relationships. It understands that a price displayed below a product image belongs to that product, not the one three cards over. It recognises standard UI patterns: hamburger menus, search bars, shopping carts, pagination controls.

Some frameworks like Browser Use annotate the screenshot before sending it. They overlay numbered labels on interactive elements, creating a bridge between what the model sees and what it can click. The model says "click element 14" and the automation layer maps that label back to a DOM node.

The Cost of Seeing

Vision is expensive. Sending a screenshot to a language model costs significantly more tokens than sending a text summary of the same page. A single 1080p screenshot might cost 1,000+ tokens, while the same page's text content fits in 200. Over hundreds of pages, this adds up quickly.

Latency is the other cost. Processing an image takes longer than processing text. An agent that relies on vision for every page interaction will be noticeably slower than one using DOM parsing alone. Most production agents use vision selectively: they parse the DOM first, and only fall back to vision when the text representation is insufficient.

Resolution matters too. Fine print, small icons, and densely packed data tables can be misread at standard screenshot resolutions. Agents occasionally crop and zoom specific page regions for a closer look, but this adds more API calls and more cost.

When Agents Choose Vision Over Text

The decision is usually practical, not random. Agents tend to use vision in specific situations:

Verification. After performing an action (clicking a button, submitting a form), the agent takes a screenshot to visually confirm the expected result occurred. Did the item actually get added to the cart? Did the error message appear?

Ambiguous layouts. When the DOM structure is unclear, perhaps because of deeply nested divs with no semantic meaning, a screenshot gives the agent spatial context that the HTML cannot provide. Understanding how LLMs interpret page layouts helps explain why some DOMs are harder to reason about than others.

Image-embedded content. Any text or data baked into images, canvas elements, or SVGs that do not have accessible text alternatives.

Visual-only state. Loading spinners, progress bars, colour-coded status indicators, and greyed-out disabled states that exist in CSS but not always in HTML attributes.

What This Means for Your Site

If your site communicates important information visually, you are already reaching vision-capable agents. But you are excluding text-only agents and screen readers. The best strategy is redundancy: communicate information both visually and structurally.

Add alt text to images that contain meaningful content. Use semantic HTML so the DOM tells the same story the visual layout does. Provide text alternatives for chart data. Make sure form states (disabled, error, loading) are expressed in HTML attributes, not just in CSS.

Sites that do this well work for every type of agent and every type of user. You can check how agents interact with your site to see whether your visual content is accessible to both text-based and vision-based agents. The gap between what a sighted human sees and what an agent can extract from your page should be as small as possible.