Multi-Modal Agents: When AI Can See Your Website Like a Human
For years, bots interacted with websites through the DOM. They parsed HTML, followed links, and extracted text. They were blind to how a page actually looked. A red warning banner and a green success message were the same thing if they shared the same HTML structure.
That changed when language models gained vision capabilities.
How Multi-Modal Agents See Pages
A multi-modal agent takes a screenshot of the current browser viewport and sends it to a vision-capable language model alongside the task instructions. The model processes the image and identifies interactive elements, text content, layout structure, and visual cues.
This is not optical character recognition. The model understands spatial relationships. It can tell that a "Buy Now" button is associated with a specific product because they are visually grouped together. It knows that a navigation menu runs across the top of the page because of how the elements are laid out, not because of their HTML nesting.
Some agent frameworks combine visual and DOM-based approaches. Browser Use, for instance, overlays numbered labels on interactive elements in the screenshot, then gives the model both the annotated image and a simplified DOM. The model can reference elements by their visual label number, which connects what it sees with what it can interact with programmatically.
Why Vision Matters
DOM-only approaches miss important information. Consider these common scenarios:
Visual grouping. A product card on an e-commerce site might contain a title, price, rating, and "Add to Cart" button. In the DOM, these might be sibling <div> elements with no semantic relationship. Visually, they are clearly grouped as a single unit. A vision model picks up on this grouping immediately.
Status indicators. A coloured dot next to a username (green for online, grey for offline) conveys information that exists only in CSS or inline styles. A DOM-only agent would need specific logic to interpret background-colour values. A vision model recognises the pattern because it has seen it on thousands of websites.
Layout context. When a page has a main content area and a sidebar, a vision model understands that the sidebar contains secondary information. It can prioritise reading the main content first. A DOM-only approach would need to infer this from class names or element ordering, as explored in how LLMs interpret page layouts, which varies across sites.
Disabled states. A greyed-out button looks disabled to a human and to a vision model. In the DOM, it might or might not have a disabled attribute, depending on how it was implemented.
The Technical Pipeline
Here is how a typical multi-modal agent interaction works:
- The browser automation layer (usually Playwright) captures a screenshot of the current viewport.
- The screenshot is optionally annotated. Some frameworks add bounding boxes or labels to interactive elements.
- The image and any supplementary data (DOM snippet, accessibility tree, task context) are sent to the model.
- The model returns an action: click at specific coordinates, type text, scroll, or indicate that the task is complete.
- The automation layer executes the action and captures a new screenshot.
Coordinate-based clicking is common in vision-heavy agents. Instead of identifying a DOM element and clicking it programmatically, the model outputs pixel coordinates and the automation layer clicks at that position. This works even when elements are difficult to target through the DOM, such as items inside canvas elements or complex SVG graphics.
What Your Site's Visual Design Signals to Agents
Visual design choices that seem purely aesthetic actually carry meaning that multi-modal agents pick up on.
Contrast and hierarchy tell agents what is important. Large, high-contrast headings are recognised as titles. Smaller, lighter text is treated as secondary information. If your primary call-to-action button looks the same as every other element on the page, an agent may not prioritise it.
Whitespace and separation communicate structure. Elements with clear visual boundaries between them are understood as distinct sections. A wall of text with no visual breaks is harder for both humans and agents to parse.
Consistent patterns help agents generalise. If every product card on your site follows the same visual layout (image on top, title below, price below that), an agent can quickly learn the pattern and extract information from any card. Inconsistent layouts force the agent to re-analyse each card independently.
Icon usage can help or confuse. A shopping cart icon next to a number is widely understood to indicate items in a basket. But custom or ambiguous icons without text labels may confuse a vision model just as they confuse human users encountering them for the first time.
The Limits of Vision
Vision-based approaches are slower and more expensive than DOM-only methods. Sending a screenshot to a model takes more tokens and more processing time than sending a text representation of the page.
Accuracy drops with complex or cluttered interfaces. Dense data tables, overlapping elements, and small text can be misread. Image-heavy sites pose particular challenges. Pages with many similar-looking elements (think a grid of twenty product cards) make it harder for the model to identify the right one.
Viewport size matters too. An agent only sees what is currently on screen. Content below the fold requires scrolling, and the agent needs to decide when and how far to scroll. A long page might need multiple screenshots to fully understand.
Practical Implications
If your website relies on visual cues to communicate information, those cues now matter for machines as well as humans. A warning message styled in red but with no role="alert" attribute is visible to a multi-modal agent but invisible to a DOM-only one.
The best approach is redundancy. Convey information both visually and structurally. Use colour and layout for humans and visual agents. Use semantic HTML, ARIA attributes, and text labels for DOM-based agents and screen readers. When both channels carry the same information, every type of user, human or artificial, can understand your site.