The role of sitemaps and robots.txt in the age of AI agents
Sitemaps and robots.txt have been around for decades. They were built for search engine crawlers, but AI agents are now using them too, and they interpret them a bit differently than Googlebot does. Combined with Schema.org markup, they form the foundation of how agents navigate your site.
How AI agents use sitemaps
A search crawler uses your sitemap to discover pages it might not find through links. An AI agent uses your sitemap for something more immediate: understanding what your site contains and how it is organised.
When an agent is asked to "find the returns policy on example.com", it can do one of two things. It can click through your navigation, following links until it finds the right page. Or it can fetch your sitemap, scan the URLs, and go directly to /returns-policy. The second approach is faster and more reliable.
A good sitemap for agent consumption includes:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
<lastmod>2026-02-28</lastmod>
<priority>1.0</priority>
</url>
<url>
<loc>https://example.com/products</loc>
<lastmod>2026-02-27</lastmod>
<priority>0.8</priority>
</url>
<url>
<loc>https://example.com/returns-policy</loc>
<lastmod>2026-01-15</lastmod>
<priority>0.5</priority>
</url>
</urlset>
The <lastmod> date tells agents how fresh the content is. The <priority> value helps agents decide which pages are most important when they need to choose between several options.
Making sitemaps more useful
Use descriptive URL paths. An agent can infer meaning from URLs. /returns-policy is clear. /page?id=47 tells it nothing.
Organise with sitemap indexes. For larger sites, use a sitemap index that groups pages by type:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemaps/products.xml</loc>
<lastmod>2026-02-28</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemaps/blog.xml</loc>
<lastmod>2026-02-25</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemaps/support.xml</loc>
<lastmod>2026-02-20</lastmod>
</sitemap>
</sitemapindex>
An agent looking for product information can go straight to the products sitemap. One looking for help articles can check the support sitemap. This saves time and reduces unnecessary page loads.
Keep your sitemap current. A stale sitemap with dead links wastes agent time and may cause the agent to report that a page does not exist when it has simply moved.
Robots.txt and agent access
Robots.txt controls which pages crawlers and agents are allowed to visit. The challenge is that AI agents use many different user-agent strings, and new ones appear regularly.
A typical robots.txt might look like this:
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /checkout/
Disallow: /account/
User-agent: GPTBot
Allow: /products/
Allow: /blog/
Disallow: /
User-agent: Amazonbot
Allow: /products/
Disallow: /
Sitemap: https://example.com/sitemap.xml
The wildcard rule allows general access while blocking private areas. The specific rules for GPTBot and Amazonbot restrict those crawlers to certain sections.
The agent-specific considerations
Do not block pages agents need to complete tasks. If users send AI agents to your site to make purchases, blocking /checkout/ from all bots means those agents cannot complete the transaction. You need to distinguish between crawlers (which index content) and user-directed agents (which act on behalf of a specific person).
This is where it gets tricky. Robots.txt was not designed for this distinction. Some options:
- Allow task-completion pages for all agents and use authentication to control access
- Create an allowlist of known user-agent strings for interactive agents
- Use a combination of robots.txt for crawlers and server-side logic for authenticated agents
Include the Sitemap directive. Always point to your sitemap from robots.txt. Agents check robots.txt first, and the Sitemap line tells them where to find your full site structure.
Sitemap: https://example.com/sitemap.xml
The X-Robots-Tag header
For more granular control, use the X-Robots-Tag HTTP header. This works on non-HTML resources (PDFs, images) and gives per-page control without modifying robots.txt.
X-Robots-Tag: index, follow
X-Robots-Tag: GPTBot: noindex
This tells general crawlers to index the page while specifically preventing GPTBot from indexing it.
What to do right now
- Audit your sitemap. Remove dead URLs, add
<lastmod>dates, and verify the sitemap loads at/sitemap.xml - Review your robots.txt. Make sure you are not accidentally blocking pages that user-directed agents need
- Add the Sitemap directive to your robots.txt if it is not there
- Consider creating a separate sitemap for your most agent-relevant pages (products, services, support)
- Test by fetching your sitemap and robots.txt manually, and verify both parse correctly
These files are often set up once and forgotten. With AI agents actively reading them, they deserve the same attention as any other part of your site infrastructure. Run an audit to check whether yours are properly configured.