The New Default Is Closed
For most of the web's history, crawling was permissionless. A bot showed up, read your pages, and the only thing standing in its way was a robots.txt file that politely asked it to behave. That model is ending. The infrastructure layer the web runs on is shifting from open by default to closed by default, and the change is happening fast enough that many brands have already had their AI crawler policy decided for them without ever making a choice.
The clearest example is Cloudflare, which sits in front of a large share of the web. As of July 2025, new websites set up through Cloudflare block all AI crawlers by default, and site owners must actively grant specific bots permission to access content. That is a reversal of the old assumption. Where the web used to be open unless you closed it, a meaningful slice of it is now closed unless you open it. If you launched a site on Cloudflare in the last year and never touched your bot settings, your pages may be invisible to AI systems right now.
The core risk. A default block does not feel like a decision, but it is one. Brands that want to be found in AI search can disappear from it without ever clicking a single setting.
This matters because being cited by AI systems has become a real channel, not a curiosity. AI Overviews, AI Mode, and conversational assistants answer questions by retrieving from live web content, and the pages they retrieve from get the visibility. A blanket block aimed at protecting your content from AI training can quietly cut you out of that channel too. The rest of this piece is about telling those two things apart and making a deliberate choice instead of inheriting a default.
What Cloudflare Pay per Crawl Actually Does
Cloudflare launched Pay per Crawl, a marketplace built to give publishers control and content monetization over AI crawling. The premise is that if AI companies are going to read your content to build products, you should be able to decide the terms. For each AI crawler, a publisher can make one of three choices: allow access for free, charge a per-crawl fee, or block access entirely. That third option, paid access, is the genuinely new piece. It turns crawling from something that just happens into something with a price tag attached.
The mechanics are handled in the middle. Cloudflare sits between the publisher and the AI company, handling payment processing and distributing revenue to publishers. A site owner does not have to negotiate contracts with each AI vendor or build billing infrastructure. They set a policy, and the marketplace enforces and settles it. Pay per Crawl is currently in private beta, so it is not yet a universal switch, but the direction it points is the important part. The plumbing for a paid, permissioned crawl economy is being built.
There is a disclosure requirement baked into the model that does a lot of the useful work. AI companies must disclose whether they are crawling content for training purposes, for live search responses, or for other uses. Without that, every AI bot would look the same and your only real options would be all-or-nothing. With it, you can treat a crawler that feeds live answers differently from one that feeds a training run. That distinction is the whole game, and it is where most brands get the decision wrong.
Training Crawlers vs Retrieval Crawlers
There are two fundamentally different reasons an AI system fetches your page, and conflating them is the most expensive mistake in this whole topic. A training crawler reads your content to help build or refine a model. The model absorbs patterns from your text, and your specific page is not directly surfaced to a user later. A retrieval crawler, by contrast, fetches your page to answer a live question right now. When an assistant says something and links a source, or when an AI Overview quotes a passage, a retrieval crawler went and got that page in real time.
The visibility consequences are opposite. Blocking training crawlers costs you nothing in terms of being cited, because training crawlers do not produce citations. Blocking retrieval crawlers costs you everything in terms of being cited, because retrieval is the mechanism by which your content shows up in AI answers at all. A brand that blocks both because both say AI in the name has thrown away its AI search presence to protect against a thing that was never going to surface its pages anyway.
This is exactly why the disclosure requirement matters and why a thoughtful policy is possible. You want to keep retrieval crawlers open so your content stays eligible to be quoted and linked. You can then make a separate, calmer decision about training access, where the trade-offs are about content rights and monetization rather than visibility. If you want the deeper mechanics of how AI systems decide which retrieved pages to actually cite, our guide on how Google AI Overviews choose sources walks through the selection logic.
How Brands Go Invisible by Accident
The accidental-invisibility problem is not hypothetical, and it does not require any malice or even any active mistake. It happens through inheritance. A marketing team spins up a new property on Cloudflare. The default blocks all AI crawlers. Nobody on the team thinks of AI crawler permissions as a launch checklist item, because a year ago they did not exist. The site goes live, ranks fine in traditional search, and quietly never appears in a single AI answer because the retrieval crawlers were turned away at the edge before they ever reached the page.
What makes this hard to catch is that the symptom is an absence. There is no error message for not being cited. Your analytics show traffic from classic search, your pages render correctly, and everything looks healthy. The gap only shows up if you go looking for it, by checking whether your brand appears when an assistant answers questions in your category, or by auditing which bots your edge configuration actually permits. Most teams never run that check, so the loss compounds silently.
The fix is to treat AI crawler access as a deliberate configuration you review, the same way you would review your robots.txt or your canonical tags. Audit which AI bots reach your origin, confirm the retrieval crawlers are allowed, and document the policy so it survives the next site migration or platform change. If you want a structured pass over your whole setup, our SEO audit service includes an AI crawler accessibility review, and our AIO readiness checker gives you a fast first read on whether your site is built to be found by AI.
Why Network-Level Control Changes the Stakes
Part of why this moment is different comes down to where the control lives. These are network-level controls, which are stronger than robots.txt because robots.txt can be ignored by bots. A robots.txt directive is a sign on the door, and a poorly behaved crawler can read it and walk in anyway. There has never been real enforcement behind it, only convention. Network-level controls operate at the edge, before a request reaches your origin server, so a blocked bot is actually stopped rather than asked nicely.
That strength is genuinely useful when you want to block or monetize, because it means your policy has teeth. If you charge for crawl access, the payment is enforced rather than requested. If you block a training crawler, it is actually blocked. For publishers who have watched their archives get scraped against their stated wishes for years, this is the first time the stated wishes carry weight. The shift from honor system to enforcement is real and it favors content owners.
The same strength is what makes the accidental block so damaging. Under the old robots.txt regime, a misconfiguration was soft. A retrieval crawler that ought to have been blocked might still get through, which was sloppy but sometimes saved you from your own mistakes. Under network-level enforcement, there is no slack. If your edge config turns away the retrieval crawler, it is gone, full stop. Stronger controls mean both your intended policies and your unintended ones execute exactly as configured, which raises the cost of not paying attention.
Block, Charge, or Allow: Working the Decision
With the categories clear, the decision becomes a small matrix rather than a vague worry. You have two kinds of crawler, training and retrieval, and three actions, allow, charge, or block. The retrieval column is the easy one for almost every brand reading this. If you want to be cited and found in AI search, you allow retrieval crawlers. Charging or blocking them is choosing to opt out of the channel you are presumably trying to win, which only makes sense in rare cases where you genuinely do not want AI surfacing your content at all.
The training column is where the real judgment lives, and it is legitimately contested. A large publisher with a deep, valuable archive may reasonably choose to charge for training access or block it, treating its content as an asset that AI companies should pay to learn from. A growth-stage brand whose goal is reach and authority usually has the opposite incentive, because being part of the models that shape AI answers can help rather than hurt. There is no single correct answer here, but there is a single correct process: decide training and retrieval separately, on their own merits.
The default that fits most brands. Allow retrieval so you stay citable, then decide training access on content-rights grounds. Blanket-blocking everything is almost never the right call for a brand chasing AI visibility.
For brands whose entire strategy is built around AI citations, this decision is foundational rather than peripheral. Everything in our work on getting cited across ChatGPT, Claude, and Perplexity assumes the retrieval crawlers can actually reach your pages. The best content and schema in the world earn nothing if the bot that would have quoted them is stopped at the edge. Access is the precondition for everything else, which is why it belongs at the top of the checklist, not the bottom.
Pairing Access Policy With an llms.txt File
Deciding which crawlers to allow is the access layer. Telling the ones you allow how to treat your site is a separate, complementary step, and that is what an llms.txt file is for. Where robots.txt governs whether a bot may crawl, an llms.txt file signals how AI systems should understand and use your content: which pages are the canonical sources, how your site is organized, and what you most want surfaced. It is a guide for the crawlers you have chosen to welcome, not a gate.
The two work together. Allowing a retrieval crawler gets your content in the door. An llms.txt file helps that crawler make sense of what it finds and points it at the material you want represented in AI answers. On a large or sprawling site, that guidance is the difference between an AI system citing your strongest, most current page and citing something stale or tangential. You can generate a starting file in seconds with our llms.txt generator and then refine it as your content evolves.
Structured data sits alongside this as the other half of being legible to machines. Clean, valid schema tells AI systems what your content claims in a form they can parse without guessing, which makes accurate citation more likely. Access, llms.txt, and schema form a stack: the first lets the crawler in, the second tells it how to read your site, and the third tells it what each page actually means. Our guide to structured data for AI search and citations covers the schema layer in depth.
Who Should Actually Charge for Crawling
Pay per Crawl makes charging possible, but possible is not the same as advisable for most sites. The economics only work in your favor when your content has scarcity value that AI companies have a reason to pay for. A major news organization with a deep, current, hard-to-replicate archive is in a real negotiating position. A typical B2B brand whose blog covers topics also covered by a hundred competitors is not, because an AI company can simply route around a paywall and pull the same information from a source that did not charge.
The risk of charging without leverage is straightforward. If you put a price on retrieval and the AI systems decide your content is not worth it, they pull from someone else and you lose the citation. For most brands, a citation is worth more than a micropayment would be, because the citation drives awareness, authority, and downstream traffic that compounds. Charging makes sense when you are protecting something genuinely scarce, and it backfires when you are taxing access to content that is freely available elsewhere.
Training access is where charging is more defensible for a wider range of publishers, because the calculus is about rights and one-time value rather than ongoing visibility. You can block or charge for training while keeping retrieval free, capturing whatever value your archive holds for model-building without sacrificing your presence in live answers. That split, monetize or protect training, keep retrieval open, is the configuration that lets a content owner have it both ways. It only works because the disclosure requirement lets you tell the two crawl purposes apart in the first place.
Setting a Policy You Can Live With
A workable AI crawler policy starts with an audit, not an opinion. Find out which AI bots currently reach your origin and which are blocked, whether by a Cloudflare default, an old robots.txt rule, or a firewall setting nobody remembers adding. You cannot make a deliberate choice until you know what your current accidental choice is. For many brands this first step alone surfaces that their retrieval access was closed by a default they never saw.
From there, write the policy down in plain terms: retrieval crawlers allowed, training access set to whatever you decided and why. Documenting the reasoning matters because the next person who touches your infrastructure, or the next platform migration, will otherwise reset everything to a default and undo your work silently. A one-paragraph policy in your engineering and marketing runbooks is cheap insurance against re-inheriting the closed default six months from now. Pair it with the llms.txt and schema work so the crawlers you allow actually understand your site.
Finally, monitor the outcome, because access is a means and citations are the end. Check periodically whether your brand shows up when AI assistants answer questions in your category, and treat a persistent absence as a signal to re-audit your crawler settings before assuming it is a content problem. This is the kind of ongoing work our AIO optimization service is built to run, from the initial access audit through the structured-data and llms.txt layers and into measurement. If you would rather get the access decision right once, with expert hands, that is the place to start. You can open a conversation about your situation through our optimization consultation.
Frequently Asked Questions
Does blocking AI crawlers hurt my search visibility?
It depends entirely on which crawlers you block. Blocking training crawlers, the ones that fetch your pages to train AI models, does not remove you from live AI answers. Blocking retrieval crawlers, the ones that fetch your pages to build a live response or citation, does remove you from those answers. A blanket block hits both, so a brand that wants to be cited in AI search can make itself invisible without realizing it. The safe default for a visibility-focused brand is to allow retrieval and decide separately on training.
What is Cloudflare Pay per Crawl?
Pay per Crawl is a Cloudflare marketplace that gives publishers control and monetization over AI crawling. For each AI crawler, a site owner can allow access for free, charge a per-crawl fee, or block access entirely. Cloudflare sits in the middle, handling payment processing and distributing revenue to publishers. It is currently in private beta. We cover the broader machine-traffic shift in our guide to AI crawler optimization for the machine-majority web.
Are new Cloudflare sites blocking AI crawlers automatically?
Yes. As of July 2025, new websites set up through Cloudflare block all AI crawlers by default, and site owners must actively grant specific bots permission to access content. If you launched on Cloudflare and never reviewed your bot settings, your pages may already be closed to the retrieval crawlers that feed AI answers. This is the single most common way brands go accidentally invisible in AI search.
Why is network-level control stronger than robots.txt?
robots.txt is a request, not an enforcement mechanism, and a bot can ignore it. Network-level controls like Cloudflare operate at the edge before a request reaches your origin, so a blocked bot is actually stopped rather than politely asked to leave. That strength cuts both ways. It makes monetization and blocking real, and it makes an accidental block real too, because there is no honor system to fall back on.
Do AI companies have to say what they are crawling for?
Under the Cloudflare model, AI companies must disclose whether they are crawling content for training purposes, for live search responses, or for other uses. That disclosure is what makes a thoughtful policy possible. You can allow the retrieval crawlers that drive citations while charging or blocking the training crawlers that build models, instead of treating every AI bot as one undifferentiated category.
Should most brands block or allow AI crawlers?
For most brands that want to be cited and found in AI search, the answer is allow retrieval, not block. The retrieval crawlers are what put your content into AI Overviews, AI Mode, and conversational answers, and blocking them removes the visibility you are trying to build. Charging or blocking can make sense for training access, where large publishers protect or monetize their archives, but that is a separate decision from keeping live answers open.
Never miss an update
Get the latest AI and SEO strategies delivered to your inbox.