How to Block AI Crawlers with Robots.txt (GPTBot, CCBot, and More)

·6 min read

AI companies are scraping the web at industrial scale to train their models. If you run a website, your content is almost certainly being consumed by bots like GPTBot (OpenAI), CCBot (Common Crawl), and Google-Extended — often without your knowledge or consent.

The good news: robots.txt gives you a straightforward way to block them. Here's exactly how to do it.

Why AI Crawlers Are Different from Search Engine Bots

Googlebot and Bingbot crawl your site to index it and send you traffic. That's a fair trade — they read your content, and in return, people find your site through search results.

AI crawlers don't work that way. They read your content to train language models. You get nothing back. No traffic, no attribution, no compensation. Your writing becomes training data for a chatbot that may eventually compete with your site for the same audience.

That's why many publishers — from the New York Times to small blog owners — have started blocking these bots.

The Complete List of AI Crawlers to Block

Here are the major AI crawlers active in 2026, along with who operates them:

OpenAI:

  • GPTBot — used to train GPT models
  • OAI-SearchBot — powers ChatGPT search (blocking this removes you from ChatGPT search results)
  • ChatGPT-User — real-time browsing when a ChatGPT user asks it to visit a URL

Google:

  • Google-Extended — trains Gemini and other Google AI products (separate from Googlebot, so blocking it won't affect your Google Search rankings)

Common Crawl:

  • CCBot — nonprofit that builds massive web datasets used by dozens of AI companies

Anthropic:

  • anthropic-ai — trains Claude models
  • ClaudeBot — real-time web fetching for Claude

Meta:

  • FacebookBot — used for AI training (separate from the crawler that generates link previews)
  • Meta-ExternalAgent — Meta's dedicated AI training crawler

Other:

  • Bytespider — ByteDance (TikTok parent company)
  • cohere-ai — Cohere language models
  • PerplexityBot — Perplexity AI search
  • Applebot-Extended — Apple's AI training crawler (separate from regular Applebot)

The Robots.txt Rules — Copy and Paste

Add this block to your robots.txt file. If you already have one, add these rules before any Allow directives:

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

Each User-agent line targets a specific bot. Disallow: / tells it "don't crawl anything on this site."

You can generate a properly formatted robots.txt with these rules using our Robots.txt Generator — just toggle on the AI crawler blocking options.

Selective Blocking: Allow Search, Block AI

Most site owners want to keep Googlebot and Bingbot (they send you traffic) while blocking AI crawlers (they don't). The rules above already handle this correctly because each User-agent directive only applies to the named bot.

Your regular search engine crawlers are unaffected. Googlebot and Google-Extended are completely separate user agents. Blocking Google-Extended has zero impact on your Google Search rankings or indexing.

Same with Applebot vs Applebot-Extended, and Facebook's link preview crawler vs FacebookBot.

Partial Blocking: Protect Some Content, Share the Rest

Maybe you want AI bots to access your marketing pages but not your blog content or premium articles. You can scope the Disallow rules:

User-agent: GPTBot
Disallow: /blog/
Disallow: /premium/
Disallow: /members/

User-agent: CCBot
Disallow: /blog/
Disallow: /premium/
Disallow: /members/

This lets AI crawlers see your homepage and product pages while keeping your original content protected.

What Robots.txt Can't Do

Be realistic about the limitations:

  1. It's voluntary. Robots.txt is a convention, not a firewall. Well-behaved bots from major companies respect it. Rogue scrapers won't. If a crawler ignores your robots.txt, you'd need server-level blocking (IP bans, WAF rules, or rate limiting).

  2. It's not retroactive. If GPTBot already crawled your site last year, that data is already in OpenAI's training set. Blocking it now only prevents future crawling.

  3. New bots appear constantly. The list above covers the major players as of early 2026, but new AI companies and new crawler user agents pop up regularly. Check your server access logs periodically for unfamiliar bot traffic.

  4. It won't block everything from a company. Some AI training uses data from Common Crawl's existing archives rather than live crawling. Blocking CCBot today doesn't remove your site from datasets already collected.

How to Verify It's Working

After updating your robots.txt:

  1. Check the file is accessible. Visit yourdomain.com/robots.txt in a browser and confirm the rules appear correctly.
  2. Use Google Search Console. The robots.txt tester shows how Google interprets your file — useful for catching syntax errors.
  3. Monitor server logs. Check your access logs for the blocked user agents. If they're still hitting your pages, something's wrong with your rules or they're ignoring robots.txt entirely.
  4. Test with our tool. Paste your robots.txt into the Robots.txt Generator to validate the syntax.

Should You Block AI Crawlers?

This isn't a clear-cut decision for everyone.

Block them if:

  • Your content is your product (articles, guides, courses, creative writing)
  • You're concerned about AI-generated competitors using your material
  • You want to take a principled stance on consent and attribution

Consider keeping them if:

  • You want visibility in AI-powered search tools (ChatGPT search, Perplexity)
  • Your site benefits from being referenced in AI responses
  • You're running a tool or service site where the value isn't in the text content

Many site owners take a middle path — blocking training crawlers like GPTBot and CCBot while allowing search-adjacent bots like OAI-SearchBot and PerplexityBot that could drive traffic.

Set It Up in 30 Seconds

You don't need to manually write these rules. Our Robots.txt Generator lets you toggle AI crawler blocking with a few clicks and generates a clean, valid file ready to upload.

If you're also working on your site's search presence, make sure your meta tags and XML sitemap are in order — blocking AI bots is one piece of a solid technical SEO setup.

Ready to try it?

Build a robots.txt file for your website. Control which pages search engines can crawl with an easy visual editor.

🤖 Robots.txt Generator — Free Online Tool

Get notified about new SEO tools

More free tools coming soon — keyword research, sitemap generator, and more.