robots.txt for AI crawlers, done right
Last updated: May 20, 2026
TL;DR
AI crawlers are not Googlebot — they use their own user-agents and read your /robots.txtindependently. If you want to be cited by AI assistants, explicitly allow the AI bots on your public pages, keep your private paths disallowed, and make sure your CDN or WAF is not silently 403-ing them behind your back. Most “invisible to AI” sites lose here, on the boring basics.
Why robots.txt is the first thing that breaks
Before an assistant can summarize or cite a page, its crawler has to be allowed to fetch it. robots.txt is the very first checkpoint, and it is the single most common reason a site is invisible to AI — usually not from a deliberate block, but from a default rule, a migrated config, or a CDN setting nobody revisited. It is also the cheapest thing to get right.
The user-agents that matter
These are distinct bots with distinct user-agents — a rule for one does not cover another:
- GPTBot — OpenAI training/retrieval. See the full GPTBot reference.
- OAI-SearchBot / ChatGPT-User — ChatGPT search & user-triggered fetches.
- ClaudeBot (and
anthropic-ai,Claude-Web) — Anthropic. - PerplexityBot — Perplexity Answers.
- Google-Extended — controls Gemini/Vertex use withoutaffecting normal Google Search ranking.
- Bytespider — ByteDance/Doubao, mixed compliance reputation.
Directives that actually work
Keep it explicit. A permissive default plus per-bot groups that allow your public content and disallow private areas (dashboards, auth, internal endpoints) is the pattern that ages well. Two rules of thumb: Google-Extended is not Googlebot — disallowing it does not hurt your search ranking, it only opts you out of Gemini; and an empty Disallow:means “allow everything”, which is often what you want for marketing pages. Add a Sitemap: line so discovery is not guesswork.
A note on Content-Signal
The contentsignals.org Content-Signal:line lets you express purpose-based consent (search vs. AI input vs. AI training) alongside the bot-by-bot rules. Adoption is early and some strict validators flag it as an “unknown directive”, but compliant crawlers ignore lines they do not understand (RFC 9309), so it is safe to include if it matches your policy.
The silent trap: your CDN, not your file
Your robots.txt can be perfect and you can still be invisible. Cloudflare bot-fight mode, WAF rules, and “block AI scrapers” toggles return 403/401/429 to AI user-agents before the request ever reaches your origin — robots.txt never gets a say. Always test what the bots actually receive, post-CDN, not what your file says. If you would rather monetize than block, that is a different lever entirely: Bot Paywall charges specific bots per URL instead of refusing them.
Verify it the way a bot sees it
Reading your own robots.txt in a browser proves nothing — you are not GPTBot. Fetch the page as each AI user-agent and check the real status code. You can run a free audit that does exactly this (raw fetch, per-agent, post-CDN) and flags blocked bots by severity. Once crawlability is clean, the next layers are structure and an llms.txt — and the broader checklist is in getting cited by AI assistants.
FAQ
Does blocking GPTBot affect Google ranking?
No. GPTBot is OpenAI's crawler, entirely separate from Googlebot. Blocking it only removes you from ChatGPT — your Google ranking is unchanged.
Is one wildcard rule enough for all AI bots?
A User-agent: * group applies to bots without their own group, but being explicit per major bot is clearer, auditable, and avoids surprises when a vendor changes behaviour.
robots.txt looks fine but I am still not cited — why?
Usually the CDN/WAF is blocking bots post-file, or the content only renders via JavaScript. Audit as the bot to see which layer fails.
Check your robots.txt against every AI bot: run a free audit or see the plans.