Block AI Crawlers on Cloudflare Without Losing Citations

The other day I opened my own robots.txt just to look. I had no plan to change anything, I only wanted to see which bots were currently allowed in. As I read down the list, one line stopped me: ClaudeBot was blocked. The one bot in there I actually wanted to let in.

ClaudeBot is the crawler Claude sends to read your site when someone asks it something and it needs to fetch a real source to cite. Blocking it is the same as saying: when someone asks Claude about what I wrote, don't reach for me. And the reason that line was there at all was Cloudflare. It had set it automatically. There is a single switch that blocks every AI bot at once, and ClaudeBot had been swept up with the rest.

Most people's first reaction to an AI bot is to wall them all off, just to be safe. It feels prudent. But look closer and "take it to train on" and "cite it live" turn out to be completely different jobs, and blocking everything at once shuts two doors when you only meant to shut one.

Blocking every AI bot feels safe. But you're really closing both the door where people take your work to train on for free, and the door that lets people discover you through AI, when you only wanted to close the first one.

The terms, all in one place

robots.txt the sign on your front gate telling bots whether they may enter. It's only a request; polite bots obey it.
ClaudeBot Anthropic's crawler. It reads your site when Claude needs to fetch a real source to cite for someone.
GPTBot OpenAI's crawler that collects content to train the model. It is not what makes ChatGPT cite you live.
OAI-SearchBot the one that lets ChatGPT pull you into a live answer. A different bot from GPTBot.
Google-Extended a switch for whether Google may train Gemini on your content. Nothing to do with your Google Search ranking.
AI Crawl Control / WAF Cloudflare's enforced layer at the network edge. It can block a bot even if the bot ignores robots.
managed robots.txt a Cloudflare feature that writes robots for you and blocks the whole set of AI bots in one click.

Part 1The line that stopped me in robots.txt

ClaudeBot wasn't blocked because I told it to be. Cloudflare has a feature that protects your site from AI bots; flip one switch and it writes robots.txt for you, adding a Disallow for around nine AI bots at once. Most people leave it on, because "block AI bots" sounds like the responsible thing to do, without going down the list to check whether one they'd want to keep is mixed in.

Going through it name by name changed the picture. These bots aren't doing the same work. Some come to collect content to train a model; some come to read so they can cite you when someone asks. Lumping them into one block and shutting it all at once means giving up the second to stop the first, when the two never had to be traded off against each other.

That's where it made me stop. The right question isn't "should I block AI bots." It's "which ones do I want to find me, and which ones are just stopping by to take things for free." Those two questions lead to completely different settings.

Part 2"Train on" and "cite live" are different things

The AI bots that read your site split roughly into two kinds by the work they do, and that line is what makes the whole thing simple.

The first kind are training bots. They collect your content to teach the next version of a model. OpenAI's GPTBot, Google-Extended, CCBot, Bytespider live here. You can block these freely if you don't want the work you wrote yourself scraped to train on for free, and crucially, blocking them barely touches whether you get found, because finding you was never their job.

The second kind are live-citation bots: the ones that make you get found and cited when someone asks. ClaudeBot is Claude's. OAI-SearchBot is the one that lets ChatGPT pull you into a live answer. Googlebot is what puts you in normal Google results. Block these and you vanish from their answers right away.

The mix-up people make most is treating the names as interchangeable. Blocking GPTBot does not make ChatGPT stop citing you, because the bot behind live ChatGPT citations is OAI-SearchBot, a different one. Same with Google-Extended: it's only a Gemini-training switch, so turning it off leaves Google Search seeing you exactly as before. Once you separate those two, the equation gets easy: block the training bots and you protect your work from being used for free at almost no cost to being found. And the live-citation bot most worth keeping open right now is ClaudeBot.

One more layer to understand: robots.txt is only a request for cooperation. Polite bots like ClaudeBot and GPTBot read it and obey. Bots that ignore the sign exist too, and to actually stop those you need an enforced layer like AI Crawl Control or a WAF rule that answers with a 403. The two layers work at different levels: the sign talks to bots that listen; enforcement handles the ones that don't.

Part 3Two Cloudflare features fight you

So I knew I wanted to let ClaudeBot in and block the training bots. Sounds like flipping a switch. But doing it for real, I found two Cloudflare features working against each other.

First, the managed robots.txt that was on is all-or-nothing. It won't let you allow one bot at a time; block the AI bots and you block the whole set, ClaudeBot can't slip out on its own. Second, when I went into AI Crawl Control and set ClaudeBot to Allow, I assumed that was it. But that button only governs the enforced layer (whether to answer with a 403). It does not remove ClaudeBot's Disallow line from the robots the system wrote. So polite ClaudeBot still comes, reads robots, sees Disallow, and stays away anyway.

The trap that hurts more: if you write your own robots.txt over the top while managed is still on, Cloudflare prepends its block in front of your file. Now there are two ClaudeBot groups in one file, the top one saying Disallow, the lower one saying Allow, and the bot honors the first it meets. You changed it and nothing changed.

The fix that actually works is two things done together. Write your own robots.txt at the edge (let everyone in as normal, add an Allow for ClaudeBot, add a per-bot Disallow for the training bots, close with a Sitemap line) and turn the managed robots.txt switch off. One without the other isn't enough; they have to be paired, so the robots you wrote is the only voice the bot hears.

Proving it worked is easy and leaves nothing to guess. Open yoursite/robots.txt and read it. You should see one ClaudeBot group with no Disallow under it, and the training bots each listed with Disallow. Then you know who's allowed and who isn't because you decided, not because you left the default running. The edge file itself and the full per-bot list are the part I'm turning into a repeatable set I can drop in again; the principle and the two-layer fix above are the part that matters.

Back to the same file: opening robots.txt again after the fix, it feels different. Not because the bot count changed, but because every line in it is something I chose. ClaudeBot can come cite me, the training bots are kept out, anyone who asks Claude about what I wrote has a shot at getting me as the answer, and the work I put in to write it isn't scooped up to train on for nothing.

If you do one thing today, add /robots.txt to your own domain and read it line by line, asking whether you actually want that bot in. Most people find they never got to choose. The whole question is one sentence: which ones do you want to find you, and which are just stopping by to take things for free.

Sources Every step comes from the real configuration on productize.life (June 2026); you can open productize.life/robots.txt yourself. The GPTBot (training) vs OAI-SearchBot (live citation) distinction per OpenAI bots docs; Google-Extended per Google crawlers docs; the managed-robots vs AI Crawl Control trap per Cloudflare AI Crawl Control docs.

In the same series

This piece is about controlling who gets to read your site. The other side is getting found by AI in the first place. Start with your site is live, but who can see it: getting found by AI

The other side is making your site usable by AI agents, not just findable. Make your site an agent-ready website

The flip side is keeping your own data from leaking out of your machine: AI data privacy, a three-layer defense

The outbound side of the same coin: give your own site an AI that answers readers, free, with Workers AI and no backend: Your static site can have AI

This is one layer of the full production AI agent architecture (7 layers).

Let Claude Cite You.
Don't Let the Rest Train on You Free.

Part 1The line that stopped me in robots.txt

Part 2"Train on" and "cite live" are different things

Part 3Two Cloudflare features fight you

Get new posts and free resources first

Join the conversation