robots.txt for AI crawlers, done right

By The AI Visibility Checker teamPublished May 20, 20267 min read

Last updated: June 1, 2026

TL;DR

AI crawlers are not Googlebot — they use their own user-agents and read your /robots.txtindependently. If you want to be cited by AI assistants, explicitly allow the AI bots on your public pages, keep your private paths disallowed, and make sure your CDN or WAF is not silently 403-ing them behind your back. Most “invisible to AI” sites lose here, on the boring basics.

Why robots.txt is the first thing that breaks

Before an assistant can summarize or cite a page, its crawler has to be allowed to fetch it. robots.txt is the very first checkpoint, and it is the single most common reason a site is invisible to AI — usually not from a deliberate block, but from a default rule, a migrated config, or a CDN setting nobody revisited. It is also the cheapest thing to get right.

The user-agents that matter

These are distinct bots with distinct user-agents — a rule for one does not cover another:

GPTBot — OpenAI training crawler. Blocking it removes you from future ChatGPT training but does not affect retrieval. See the full GPTBot reference.
OAI-SearchBot— populates ChatGPT's search index/cache. Blocking it makes you invisible to ChatGPT Search even if GPTBot is allowed.
ChatGPT-User — fetches a specific URL on behalf of a live chat. Blocking it stops grounding/verification. Roles in detail in ChatGPT Search SEO.
ClaudeBot (and the legacy anthropic-ai, Claude-Web) — Anthropic.
PerplexityBot — Perplexity's indexing crawler.
Perplexity-User— Perplexity's live per-query fetch (analogous to ChatGPT-User). More on the split in Perplexity SEO.
Google-Extended — controls Gemini/Vertex training use withoutaffecting normal Google Search ranking or AI Overviews inclusion.
Bytespider — ByteDance/Doubao, mixed compliance reputation.

Directives that actually work

Keep it explicit. A permissive default plus per-bot groups that allow your public content and disallow private areas (dashboards, auth, internal endpoints) is the pattern that ages well. Two rules of thumb: Google-Extended is not Googlebot — disallowing it does not hurt your search ranking, it only opts you out of Gemini; and an empty Disallow:means “allow everything”, which is often what you want for marketing pages. Add a Sitemap: line so discovery is not guesswork.

A note on Content-Signal

The contentsignals.org Content-Signal:line lets you express purpose-based consent (search vs. AI input vs. AI training) alongside the bot-by-bot rules. Adoption is early and some strict validators flag it as an “unknown directive”, but compliant crawlers ignore lines they do not understand (RFC 9309), so it is safe to include if it matches your policy.

The silent trap: your CDN, not your file

Your robots.txt can be perfect and you can still be invisible. Cloudflare bot-fight mode, WAF rules, and “block AI scrapers” toggles return 403/401/429 to AI user-agents before the request ever reaches your origin — robots.txt never gets a say. Always test what the bots actually receive, post-CDN, not what your file says. If you would rather monetize than block, that is a different lever entirely: Bot Paywall charges specific bots per URL instead of refusing them.

Verify it the way a bot sees it

Reading your own robots.txt in a browser proves nothing — you are not GPTBot. Fetch the page as each AI user-agent and check the real status code. You can run a free audit that does exactly this (raw fetch, per-agent, post-CDN) and flags blocked bots by severity. Once crawlability is clean, the next layers are structure and an llms.txt — and the broader checklist is in getting cited by AI assistants.

FAQ

Does blocking GPTBot affect Google ranking?

No. GPTBot is OpenAI's crawler, entirely separate from Googlebot. Blocking it only removes you from ChatGPT — your Google ranking is unchanged.

Is one wildcard rule enough for all AI bots?

A User-agent: * group applies to bots without their own group, but being explicit per major bot is clearer, auditable, and avoids surprises when a vendor changes behaviour.

robots.txt looks fine but I am still not cited — why?

Usually the CDN/WAF is blocking bots post-file, or the content only renders via JavaScript. Audit as the bot to see which layer fails.

Check your robots.txt against every AI bot: run a free audit or see the plans.

7 min read
Is Cloudflare blocking your AI crawlers?
Cloudflare can silently block GPTBot, ClaudeBot and PerplexityBot before they ever reach your robots.txt — making your site invisible to ChatGPT. How to check, and how to fix it.
7 min read
Get cited by ChatGPT, Claude & Perplexity
A practical, no-fluff checklist for making your site quotable by AI assistants in 2026: crawlability, JavaScript rendering, schema, llms.txt, and how to measure it.
7 min read
AI browsers (ChatGPT Atlas) & your site
Agentic browsers like ChatGPT Atlas read and act on your pages for the user. Here is what actually changes for your site, what to fix, and what is just hype.

← Back to all posts RSS

robots.txt for AI crawlers, done right

Why robots.txt is the first thing that breaks

The user-agents that matter

Directives that actually work

A note on Content-Signal

The silent trap: your CDN, not your file

Verify it the way a bot sees it

FAQ

Related reading

Is Cloudflare blocking your AI crawlers?

Get cited by ChatGPT, Claude & Perplexity

AI browsers (ChatGPT Atlas) & your site