Browser-Using Agents: How They Work and What They Expect
Browser-using agents are programmes that control a web browser to complete tasks on behalf of a user. They click buttons, fill in forms, read page content, and make decisions about what to do next. Unlike traditional web scrapers that parse raw HTML, these agents interact with fully rendered pages, complete with JavaScript execution, dynamic content, and visual layout.
The Basic Loop
Every browser-using agent follows roughly the same cycle: observe, decide, act.
First, the agent takes a snapshot of the current page. This might be a screenshot, the DOM tree, the accessibility tree, or some combination of all three. That snapshot gets fed into a language model along with the agent's goal, something like "book a flight from London to Berlin on March 20th."
The model then decides what action to take. It might output something like "click the element with text 'Departure City'" or "type 'London' into the search field." The agent translates that decision into a browser automation command and executes it.
After the action completes, the agent observes the page again. New content may have loaded, a modal might have appeared, or the page might have changed entirely. The cycle repeats until the task is done or the agent gets stuck.
What Sits Underneath
Most browser agents run on top of Playwright or Puppeteer. These are the same browser automation frameworks that testing teams use for end-to-end tests, and they give agents fine-grained control over a Chromium, Firefox, or WebKit instance.
Playwright has become the more popular choice for newer agent frameworks. Browser Use, one of the most widely adopted open-source agent libraries, runs on Playwright. So does Stagehand from Browserbase. The reason is practical: Playwright handles multiple browser contexts well, has solid support for waiting on network requests, and deals with iframes more reliably than older alternatives.
The agent framework sits on top of these browser tools and adds the intelligence layer. It decides which elements to interact with, how to handle errors, and when to retry. Some frameworks, like Browser Use, send a simplified version of the DOM to the language model. Others send screenshots and rely on the model's vision capabilities to identify interactive elements.
What Agents Expect from Your Website
Agents do not read your website the way a human does. They rely heavily on structured information. Here is what matters most:
Semantic HTML is the single biggest factor. Understanding how agents parse HTML structure helps explain why. An agent trying to find a login button will look for <button> elements, role="button" attributes, and text content like "Log in" or "Sign in." If your login action is a styled <div> with a click handler and no ARIA role, the agent will likely miss it entirely.
Consistent, descriptive labels make a real difference. Form fields with proper <label> elements, buttons with clear text content, and links with descriptive anchor text all help agents understand what each element does. A button labelled "Submit" is fine. A button labelled "Go" next to three other buttons also labelled "Go" is a problem.
Stable selectors matter for agents that revisit your site. If your CSS class names change on every build because of hashing, and your elements lack id attributes or data-testid markers, agents cannot reliably target the same elements across visits.
Predictable page transitions help agents maintain context. If clicking "Add to Cart" triggers a full page reload to a different URL, that is straightforward. If it opens an overlay that injects new DOM nodes while keeping the same URL, the agent needs to detect that change through DOM observation rather than navigation events.
Where Agents Struggle
Authentication flows are consistently difficult. CAPTCHAs are designed to block automated access, and multi-step authentication with email codes or authenticator apps requires the agent to switch between different systems.
Highly dynamic interfaces also cause problems. Single-page applications that load content through infinite scroll, update prices via WebSocket, or rearrange layout based on viewport size can confuse agents that expect a static page state.
Custom web components with shadow DOMs are another common problem. The agent's DOM snapshot might not include the internal structure of shadow DOM elements, making them effectively invisible.
A Practical Example
Consider a simple task: "Find the cheapest flight from London to Paris next Friday."
The agent loads a flight comparison site, identifies the departure field (probably through its label or placeholder text), types "London," waits for an autocomplete dropdown, selects the right airport, then repeats the process for the destination. It locates the date picker, works out which element represents next Friday, clicks it, and hits search. On the results page, it reads through the options, compares prices, and returns the cheapest one.
Each of those steps requires the agent to understand the page structure, handle dynamic UI elements, and recover from unexpected states. A well-structured site with clear labels and semantic markup makes every one of those steps easier. A poorly structured site turns a 30-second task into a minutes-long struggle that may never complete.
The gap between sites that work well with agents and sites that do not is growing. As more users delegate browsing tasks to AI, that gap will start to show up in engagement metrics.