Skip to content
Back to Blog
LLMsAgent TechnologyWeb Standards

How Large Language Models Interpret Web Page Layouts

Agent Checker5 min read

When a language model receives a web page, it does not see the page the way you do. There is no visual rendering in most cases. Instead, the model gets a text representation of the page, and the format of that representation shapes what the model can understand.

Three Ways to Represent a Page

Agent frameworks typically choose one of three approaches to represent a web page to a language model, and each has different strengths.

Raw HTML is the most literal representation. The model receives the actual markup, including tags, attributes, class names, and text content. This preserves all structural information but comes with a lot of noise. A typical web page contains thousands of lines of HTML, much of it navigation, ads, tracking scripts, and layout containers that are irrelevant to the task at hand.

The accessibility tree is a cleaned-up structural representation. Browsers build this tree from the DOM for screen readers, and it strips out decorative elements, hidden content, and non-interactive containers. What remains is a hierarchy of meaningful elements: headings, links, buttons, form fields, and text content. This is often the best balance between information density and relevance.

Screenshots (covered separately in multi-modal agents) provide the visual perspective. They capture layout, colour, spacing, and visual hierarchy that text representations miss.

Most agent frameworks use a combination. Browser Use, for example, extracts the DOM, simplifies it by removing invisible and irrelevant elements, and can optionally include a screenshot.

How Models Parse HTML

Large language models have seen an enormous amount of HTML in their training data. This means they have strong intuitions about common patterns. A model knows that content inside <nav> is navigation, that <h1> is the main heading, and that <form> elements contain input fields.

But these intuitions break down with unusual or obfuscated markup. Consider a site that uses <div class="x7f2q"> for everything, with layout handled entirely in CSS. The model has no way to distinguish the header from the sidebar from the main content. Without semantic tags or readable class names, the HTML is just a flat list of nested containers.

Models also struggle with extremely long HTML documents. Context windows have grown substantially, but a complex web page can easily contain 50,000 to 100,000 tokens of HTML. Even models with large context windows lose accuracy when the relevant information is buried in thousands of lines of irrelevant markup.

The Accessibility Tree Advantage

The accessibility tree is almost purpose-built for what agents need. It represents the page as a hierarchy like this:

heading "Flight Search" (level 1)
  text "Find the best flights"
group "Search Form"
  textbox "From" value=""
  textbox "To" value=""
  button "Search Flights"
navigation "Main Menu"
  link "Home"
  link "My Bookings"
  link "Support"

This is dramatically smaller than the raw HTML and captures exactly the information an agent needs: what elements exist, what they do, and what they contain.

The catch is that the accessibility tree is only as good as the page's accessibility implementation. If form fields lack labels, buttons have no text content, and headings are missing, the accessibility tree will be sparse and unhelpful, just as it would be for a screen reader user.

How Models Understand Layout Without Seeing It

When a model receives text-only page data, it has to infer layout from structural cues. It does this through several signals:

Element ordering. The model assumes that elements appearing earlier in the document are higher on the page. This is generally true for well-structured pages, but absolute positioning and CSS Grid can make the visual order completely different from the DOM order.

Heading hierarchy. An <h2> followed by several paragraphs, then another <h2>, tells the model these are two distinct sections. Proper heading levels create a clear content outline.

ARIA landmarks. role="main", role="navigation", role="complementary" (sidebar), and similar landmarks let the model understand page regions without seeing them. These are exactly the attributes that the WAI-ARIA specification defines for accessibility, and as covered in ARIA labels beyond screen readers, they serve double duty for AI agents.

Table structure. Models are reasonably good at understanding data in <table> elements with proper <th> headers. They struggle more with data presented as CSS grid layouts or visually aligned <div> elements that look like a table but have no tabular markup.

What Makes a Page Easy or Hard to Interpret

Based on how models actually process page content, certain patterns consistently produce better results.

Easy pages have clear heading structure, semantic HTML that agents can parse, descriptive link text (not "click here"), labelled form fields, and a logical DOM order that matches the visual order. They tend to be information-dense rather than layout-heavy.

Hard pages rely on visual layout for meaning, use generic <div> and <span> elements extensively, have deeply nested markup, put important content inside iframes, or split information across multiple JavaScript-rendered components with no fallback structure.

Surprisingly difficult are sites with lots of repeated structure, like search results or product listings. The model sees twenty similar blocks and may struggle to identify which one matches the user's criteria, especially if the differences are subtle.

Token Economy

There is a practical cost dimension. Every token of page content sent to a model costs money and adds latency. A typical product page might be 3,000 tokens as an accessibility tree, 15,000 tokens as simplified HTML, or 60,000 tokens as raw HTML.

Agent frameworks that minimise the representation size while keeping relevant information intact perform better on both cost and accuracy. This is why most successful frameworks aggressively prune the DOM before sending it to the model, removing hidden elements, script tags, style blocks, and decorative markup.

This pruning works best when the page has clear semantic structure. When everything is a <div>, it is hard to know what is safe to remove. When elements use proper semantic tags and ARIA roles, the framework can confidently strip out irrelevant sections and keep only what matters.

Your page structure directly affects how much it costs for an AI agent to understand your site. That might sound abstract today, but as agent traffic grows, it becomes a real factor in how often and how accurately agents can interact with your content.