The file that controls what every major agent can see on your site is a plain text file at your domain root. Most brands have never looked at it.
Your Site's Front Door for Agents
robots.txt is served at yourdomain.com/robots.txt and formally standardized in RFC 9309[1]. It uses four directives: User-agent, Disallow, Allow, and Sitemap[2]. Every well-behaved crawler checks this file before requesting anything else, making it the single most powerful access control mechanism on your site. It requires zero infrastructure to change. You can edit it in Notepad.
The Agent Crawlers You Need to Know
Each agent platform sends its own User-agent string, and you need to make deliberate decisions about each one. Blocking GPTBot does not block ChatGPT-User, and vice versa. Here are the ones that matter today:
- GPTBot: OpenAI's training and retrieval crawler[3]. Blocking this removes your content from ChatGPT's knowledge base.
- ChatGPT-User: The real-time browsing crawler that ChatGPT uses when a user asks it to look something up. Blocking GPTBot but allowing this means ChatGPT can still browse your site live.
- ClaudeBot: Anthropic's crawler[4]. The same logic applies: blocking it means Claude cannot reference your content.
- Google-Extended: Google's separate crawler for Gemini and AI Overviews, distinct from Googlebot, which handles organic search[5]. Blocking this lets you opt out of AI features without losing your search rankings.
- PerplexityBot: Perplexity's retrieval crawler. Increasingly active as Perplexity grows its answer engine.
- Applebot: Serves Apple Intelligence and Siri. Already crawling at scale.
Mistakes That Cost You Visibility
A blanket User-agent: * / Disallow: / block is the nuclear option, and it is more common than you would think. Brands deploy it during a site migration and forget to remove it afterward. Legacy rules written for bots that no longer exist can also inadvertently match new agent User-agents through wildcard patterns.
The other classic error is confusing Disallow with noindex. Disallow tells crawlers not to fetch a page, while noindex tells search engines not to index it. They solve different problems. Using the wrong one means the page either gets crawled when you do not want it to, or stays invisible when you do.
How to Get This Right
Open yourdomain.com/robots.txt right now and read it line by line. If you do not recognize every rule, investigate before changing anything. Write explicit User-agent blocks for each agent crawler you want to allow or restrict, and use specific path patterns rather than root-level blocks so you can permit product pages while restricting admin routes.
Test every change in the Google Search Console robots.txt Tester[6] before deploying. One misplaced wildcard can make your entire catalog invisible to agents overnight.
How Site Scanner Helps
Site Scanner checks whether major agent crawlers are blocked by your robots.txt, identifies the specific rule responsible, and flags accidental blocks that could be costing you visibility in agent-powered discovery.