What Is robots.txt

The file that controls what every major agent can see on your site is a plain text file at your domain root. Most brands have never looked at it.

Your Site's Front Door for Agents

robots.txt is served at yourdomain.com/robots.txt and formally standardized in RFC 9309^[1]. It uses four directives: User-agent, Disallow, Allow, and Sitemap^[2]. Every well-behaved crawler checks this file before requesting anything else, making it the single most powerful access control mechanism on your site. It requires zero infrastructure to change. You can edit it in Notepad.

The Agent Crawlers You Need to Know

Each agent platform sends its own User-agent string, and you need to make deliberate decisions about each one. Blocking GPTBot does not block ChatGPT-User, and vice versa. Here are the ones that matter today:

GPTBot: OpenAI's training and retrieval crawler^[3]. Blocking this removes your content from ChatGPT's knowledge base.
ChatGPT-User: The real-time browsing crawler that ChatGPT uses when a user asks it to look something up. Blocking GPTBot but allowing this means ChatGPT can still browse your site live.
ClaudeBot: Anthropic's crawler^[4]. The same logic applies: blocking it means Claude cannot reference your content.
Google-Extended: Google's separate crawler for Gemini and AI Overviews, distinct from Googlebot, which handles organic search^[5]. Blocking this lets you opt out of AI features without losing your search rankings.
PerplexityBot: Perplexity's retrieval crawler. Increasingly active as Perplexity grows its answer engine.
Applebot: Serves Apple Intelligence and Siri. Already crawling at scale.

Mistakes That Cost You Visibility

A blanket User-agent: * / Disallow: / block is the nuclear option, and it is more common than you would think. Brands deploy it during a site migration and forget to remove it afterward. Legacy rules written for bots that no longer exist can also inadvertently match new agent User-agents through wildcard patterns.

The other classic error is confusing Disallow with noindex. Disallow tells crawlers not to fetch a page, while noindex tells search engines not to index it. They solve different problems. Using the wrong one means the page either gets crawled when you do not want it to, or stays invisible when you do.

How to Get This Right

Open yourdomain.com/robots.txt right now and read it line by line. If you do not recognize every rule, investigate before changing anything. Write explicit User-agent blocks for each agent crawler you want to allow or restrict, and use specific path patterns rather than root-level blocks so you can permit product pages while restricting admin routes.

Test every change in the Google Search Console robots.txt Tester^[6] before deploying. One misplaced wildcard can make your entire catalog invisible to agents overnight.

How Scanner Helps

Scanner checks whether major agent crawlers are blocked by your robots.txt, identifies the specific rule responsible, and flags accidental blocks that could be costing you visibility in agent-powered discovery.

The file that controls what every major agent can see on your site is a plain text file at your domain root. Most brands have never looked at it.

robots.txt Rules

Tap or hover any line to see what it does. Toggle to see crawler visibility.

Show effect

/robots.txt

1# Robots.txt for example.com

3User-agent: *

4Allow: /products/✓ visible

5Allow: /collections/✓ visible

6Allow: /pages/about✓ visible

7Disallow: /cart✗ blocked

8Disallow: /checkout✗ blocked

9Disallow: /admin/✗ blocked

10Disallow: /search?*✗ blocked

12User-agent: GPTBot

13Allow: /products/✓ visible

14Disallow: /✗ blocked

16Sitemap: https://example.com/sitemap.xml

Tap or hover a line to see its effect on crawlers

Legend

Allow — crawlable

Disallow — blocked

User-agent selector

Allow rules expose content to AI agents. Disallow keeps sensitive paths hidden.

Your Site's Front Door for Agents

The Agent Crawlers You Need to Know

GPTBot: OpenAI's training and retrieval crawler^[3]. Blocking this removes your content from ChatGPT's knowledge base.
ChatGPT-User: The real-time browsing crawler that ChatGPT uses when a user asks it to look something up. Blocking GPTBot but allowing this means ChatGPT can still browse your site live.
ClaudeBot: Anthropic's crawler^[4]. The same logic applies: blocking it means Claude cannot reference your content.
Google-Extended: Google's separate crawler for Gemini and AI Overviews, distinct from Googlebot, which handles organic search^[5]. Blocking this lets you opt out of AI features without losing your search rankings.
PerplexityBot: Perplexity's retrieval crawler. Increasingly active as Perplexity grows its answer engine.
Applebot: Serves Apple Intelligence and Siri. Already crawling at scale.

Mistakes That Cost You Visibility

How to Get This Right

Test every change in the Google Search Console robots.txt Tester^[6] before deploying. One misplaced wildcard can make your entire catalog invisible to agents overnight.

What Is robots.txt

Your Site's Front Door for Agents

The Agent Crawlers You Need to Know

Mistakes That Cost You Visibility

How to Get This Right

How Scanner Helps

More from Learn

What Is Schema Markup

Site Architecture for Agents

Cumulative Layout Shift (CLS)

What Is robots.txt

Your Site's Front Door for Agents

The Agent Crawlers You Need to Know

Mistakes That Cost You Visibility

How to Get This Right

How Scanner Helps

More from Learn

What Is Schema Markup

Site Architecture for Agents

Cumulative Layout Shift (CLS)