The next wave of web traffic doesn't come from humans clicking links. It comes from agent crawlers reading your site to answer questions and take action on behalf of users. If your site isn't readable by these crawlers, you don't exist in their world.
The Major Crawlers
At least five major agent crawlers are active today, each feeding a different system:
- GPTBot feeds ChatGPT and is the most aggressive by request volume[1]
- Google-Extended and Googlebot feed both traditional search and Gemini, making them the only crawlers serving both paradigms[2]
- ClaudeBot feeds Anthropic's Claude and respects robots.txt directives strictly[3]
- PerplexityBot powers Perplexity's real-time answer engine and fetches pages on demand per user query[4]
- Applebot feeds Siri and Apple Intelligence across over 2 billion active Apple devices[5]
Why They're Not Search Crawlers
Search crawlers index pages for keyword retrieval. Agent crawlers extract content for comprehension and reasoning. That difference has a critical technical implication: most agent crawlers do not execute JavaScript. They are plain HTTP fetchers that need content in the initial HTML response. If your product catalog renders client-side through a React SPA, Googlebot might eventually index it, but GPTBot and ClaudeBot will see a blank page.
Intent is the other key difference. A search crawler builds a static index, but an agent crawler reads your page to answer a specific question right now. Stale content, broken structured data, or slow responses don't just hurt your ranking over time. They cause an agent to give a wrong answer about your business today.
What Agents Prioritize
Agents optimize for speed and certainty. They want facts they can trust without interpretation. Four signals matter most:
- Structured data in JSON-LD is the highest-signal content on any page. A Product schema with price, availability, and ratings gives an agent exact facts. Without it, the agent parses raw HTML and guesses what "$149" refers to.
- Semantic HTML with a logical heading hierarchy lets agents navigate content quickly. An H1 followed by H2s and H3s in order communicates document structure. Skipping from H1 to H4 forces the agent to infer relationships.
- Response speed matters because agents operate under time budgets. If your page takes 4 seconds to respond, the agent may abandon it and pull from a competitor that responds in 200ms.
- llms.txt is an emerging standard that gives agents a purpose-built summary of your site, similar to what robots.txt does for crawlers but optimized for language model consumption.
Common Mistakes
The most damaging mistake is accidentally blocking agent crawlers in robots.txt. Many sites copied boilerplate rules years ago that block unknown user agents. GPTBot, ClaudeBot, and PerplexityBot all respect robots.txt, so one overzealous Disallow line can make your entire site invisible to agents while your competitors remain visible.
JavaScript-rendered content is the second biggest gap. If your content loads via client-side fetch calls, agent crawlers see empty containers. Server-side rendering or static generation solves this completely.
Missing structured data rounds out the list. Without it, agents guess, and guessing means they sometimes get it wrong. A competitor with proper Product schema will get cited accurately while your products get paraphrased or skipped.
How Site Scanner Helps
Site Scanner evaluates your site across the same signals agent crawlers use: structured data presence and validity, content rendering method, robots.txt rules, and page response times.