Let's Know Things podcast

Pay Per Crawl

0:00
17:56
Rewind 15 seconds
Fast Forward 15 seconds

This week we talk about crawling, scraping, and DDoS attacks.

We also discuss Cloudflare, the AI gold rush, and automated robots.

Recommended Book: Annie Bot by Sierra Greer

Transcript

Alongside the many, and at times quite significant political happenings, the many, and at times quite significant military conflicts, and the many, at times quite significant technological breakthroughs—medical and otherwise—flooding the news these days, there’s also a whole lot happening in the world of AI, in part because this facet of the tech sector is booming, and in part because while still unproven in many spaces, and still outright flubbing in others, this category of technology is already having a massive impact on pretty much everything, in some cases for the better, in some for the worse, and in some for better and worse, depending on your perspective.

Dis- and misinformation, for instance, is a bajillion times easier to create, distribute, and amplify, and the fake images and videos and audio being shared, alongside all the text that seems to be from legit people, but which may in fact be the product of AI run by malicious actors somewhere, is increasingly convincing and difficult to distinguish from real-deal versions of the same.

There’s also a lot more of it, and the ability to very rapidly create pretty convincing stuff, and to very rapidly flood all available communication channels with that stuff, is fundamental to AI’s impact in many spaces, not just the world of propaganda and misinformation. At times quantity has a quality all of its own, and that very much seems to be the case for AI-generated content as a whole.

Other AI- and AI-adjacent tools are being used by corporations to improve efficiency, in some cases helping automated systems like warehouse robots assist humans in sorting and packaging and otherwise getting stuff ready to be shipped, as is the case with Amazon, which is almost to the point that they’ll have more robots in their various facilities than human beings. Amazon robots are currently assisting with about 75% of all the company’s global deliveries, and a lot of the menial, repetitive tasks human workers would have previously done are now being accomplished by robotics systems they’ve introduced to their shipping chain.

Of course, not everyone is thrilled about this turn of events: while it’s arguably wonderful that robots are being subbed-in for human workers who would previously have had to engage in the sorts of repetitive, physical tasks that can lead to chronic physical issues, in many cases this seems to be a positive side-benefit of a larger effort to phase-out workers whenever possible, saving the company money over time by employing fewer people.

If you can employ 100 people using robots instead of 1000 people sans-robots, depending on the cost of operation for those robots, that might save you money because each person, augmented by the efforts of the robots, will be able to do a lot more work and thus provide more value for the company. Sometimes this means those remaining employees will be paid more, because they’ll be doing more highly skilled labor, working with those bots, but not always.

This is a component of this shift that for a long while CEOs were dancing around, not wanting to spook their existing workforce or lose their employees before their new robot foundation was in place, but it’s increasingly something they’re saying out loud, on investor calls and in the press, because making these sorts of moves are considered to be good for a company’s outlook: they’re being brave and looking toward a future where fewer human employees will be necessary, which implies their stock might be currently undervalued, because the potential savings are substantial, at least in theory.

And it is a lot of theory at this point: there’s good reason to believe that theory is true, at least to some degree, but we’re at the very beginning phases of this seeming transition, and many companies that jumped too quickly and fired too many people found themselves having to hire them back, in some cases at great expense, because their production faltered under the weight of inferior automated, often AI-driven alternatives.

Many of these tools simply aren’t as reliable as human employees yet. And while they will almost certainly continue to become more powerful and capable—a recent estimate suggested that the current wave of large-language-model-based AI systems, for instance, are doubling in power every 7 months or so, which is wild—speculations about what that will mean, and whether that trend can continue, vary substantially, depending on who you talk to.

Something we can say with relative certainty right now, though, is that most of these models, the LLM ones, at least, not the robot-driving ones, were built using content that was gathered and used in a manner that currently exists in a legal gray area: it was scraped and amalgamated by these systems so that they could be trained on a corpus of just a silly volume of human output, much of that output copyrighted or otherwise theoretically not-useable for this purpose.

What I’d like to talk about today is a new approach to dealing with the potentially illegal scraping of copyrighted information by and for these systems, and a proposed new pricing scheme that could allow the creators of the content being scraped in this way to make some money from it.

Web scraping refers to the large-scale crawling of websites and collection of data from those websites.

There are a number of methods for achieving this, including just manually visiting a bunch of websites and copying and pasting all the content from those sites into a file on your computer. But the large-scale version of that is something many companies, including entities like Google, do, and for various purposes: Google crawls the web to map it, basically, and then applies all sorts of algorithms and filters in order to build their search results. Other entities crawl the web to gather data, to figure out connections between different sorts of sites, and/or to price ads they sell on their own network of sites or the products they sell, and which they’d like to sell for a slightly lower price than their competition.

Web scraping can be done neutrally, then, your website scraped by Google so it can add your site to its search results, the data it collects telling its algorithms where you should be in those results based on keywords and who links to your site and other such things, but it can also be done maliciously: maybe someone wants to duplicate your website and use it to get unsuspecting victims to install malware on their devices. Or maybe someone wants to steal your output: your writings, your flight pricing data, and so on.

If you don’t want these automated web-scrapers to use your data, or to access some portion or all of your site, you can put a file called robots.txt in your site’s directory, and the honorable scrapers will respect that request: the googles of the world, for instance, have built their scrapers so that they look for a robots.txt file and read its contents before mapping out your website structure and soaking up your content to decide where to put you in their search results.

Not all scrapers respect this request: the robots.txt standard relies on voluntary compliance. There’s nothing forcing any scraper, or the folks running these scrapers, to look for or honor these files and what they contain.

That said, we’ve reached a moment at which many scrapers are not just looking for keywords and linkbacks, but also looking to grab basically everything on a website so that the folks running the scrapers can ingest those images and that writing and anything else that’s legible to their software into the AI systems they’re training.

As a result, many of these systems were trained on content that is copyrighted, that’s owned by the folks who wrote or designed or photographed it, and that’s created a legal quagmire that court systems around the world are still muddling through.

There have been calls to update the robots.txt standard to make it clear what sorts of content can be scraped for AI-training purposes and what cannot, but the non-compulsory, not-legally-backed nature of such requests seem to make robots.txt an insufficient vehicle for this sort of endeavor: the land-grab, gold-rush nature of the AI industry right now suggests that most companies would not honor these requests, because it’s generally understood that they’re all trying to produce the most powerful AI possible as fast as possible, hoping to be at or near the top before the inevitable shakeout moment at which point most of these companies will go bankrupt or otherwise cease to exist.

That’s important context for understanding a recent announcement by internet infrastructure company Cloudflare, that said they would be introducing something along the lines of an enforceable robots.txt file for their customers called pay per crawl.

Cloudflare is US-based company that provides all sorts of services, from domain registration to firewalls, but they’re probably best known for their web security services, including their ability to block DDoS, or distributed denial of service attacks, where a hacker or other malicious actor will lash a bunch of devices they’ve compromised, through malware or otherwise, together, into what’s called a botnet, and use those devices to send a bunch of traffic to a website or other web-based entity all at once.

This can result in so much traffic, think millions or billions of visits per second—a recent attack that Cloudflare successfully ameliorated sent 7.3 terabytes per second against one of their customers, for instance—it can result in so much traffic that the targeted website becomes inaccessible, sometimes for long periods of time.

So Cloudflare provides a service where they’re basically like a firewall between a website and the web, and when something like a DDoS attack happens, Cloudflare’s services go into action and the targeted website stays up, rather than being taken down.

As a result of this and similarly useful offerings, Cloudflare security services are used by more than 19% of all websites on the internet, which is an absolutely stunning figure considering how big the web is these days—there are an estimated 1.12 billion websites, around 200 million of which are estimated to be active as of Q1 2025.

All that said, Cloudflare recently announced a new service, called pay per crawl, that would use that same general principle of putting themselves between the customer and the web to actively block AI web scrapers that want to scrape the customer’s content, unless the customer gives permission for them to do so.

Customers can turn this service on or off, but they can also set a price for scraping their content—a paywall for automated web-scrapers and the AI companies running them, basically.

The nature of these payments is currently up in the air, and it could be that content creators and owners, from an individual blogger to the New York Times, only earn something like a penny per crawl, which could add up to a lot of money for the Times but only be a small pile of pennies for the blogger.

It could also be that AI companies don’t play ball with Cloudflare and instead they do what many tech analysts expect them to do: they come up with ways to get around Cloudflare’s wall, and then Cloudflare makes the wall taller, the tech companies build taller ladders, and that process just spirals ad infinitum.

This isn’t a new idea, and the monetization aspect of it is predicated on some early web conceptions of how micropayments might work.

It’s also not entirely clear whether the business model would make sense for anyone: the AI companies have long complained they would go out of business if they had to pay anything at all for the content they’re using to train their AI models, big companies like the New York Times face possible extinction if everything they pay a lot of money to produce is just grabbed by AI as soon as it goes live, those AI companies making money from that content they paid nothing to make, and individual makers-of-things face similar issues as the Times, but without the leverage to make deals with individual AI companies, like the Times has.

It also seems that AI chatbots are beginning to replace traditional search engines, so it’s possible that anyone who uses this sort of wall will be excluded from the search of the future. Those whose content is gobbled up and used without payment will be increasingly visible, their ideas and products and so on more likely to pop up in AI-based search results, while those who put up a wall may be less visible; so there’s a big potential trade-off there for anyone who decides to use this kind of paywall, especially if all the big AI companies don’t buy into it.

Like everything related to AI right now, then, this is a wild west space, and it’s not at all clear which concepts will win out and become the new default, and which will disappear almost as soon as they’re proposed.

It’s also not clear if and when the larger economic forces underpinning the AI gold rush will collapse, leaving just a few big players standing and the rest imploding, Dotcom Bubble style, which could, in turn, completely undo any defaults that are established in the lead-up to that moment, and could make some monetization approaches no longer feasible, while others, including possibly paywalls and micropayments, suddenly more thinkable and even desirable.

Show Notes

https://www.wired.com/story/pro-russia-disinformation-campaign-free-ai-tools/

https://www.wsj.com/tech/amazon-warehouse-robots-automation-942b814f

https://www.wsj.com/tech/ai/ai-white-collar-job-loss-b9856259

https://w3techs.com/technologies/details/cn-cloudflare

https://www.demandsage.com/website-statistics/

https://blog.cloudflare.com/defending-the-internet-how-cloudflare-blocked-a-monumental-7-3-tbps-ddos/

https://en.wikipedia.org/wiki/Web_scraping

https://en.wikipedia.org/wiki/Robots.txt

https://developers.cloudflare.com/ai-audit/features/pay-per-crawl/use-pay-per-crawl-as-site-owner/set-a-pay-per-crawl-price/

https://techcrunch.com/2025/07/01/cloudflare-launches-a-marketplace-that-lets-websites-charge-ai-bots-for-scraping/

https://www.nytimes.com/2025/07/01/technology/cloudflare-ai-data.html

https://creativecommons.org/2025/06/25/introducing-cc-signals-a-new-social-contract-for-the-age-of-ai/

https://arstechnica.com/tech-policy/2025/07/pay-up-or-stop-scraping-cloudflare-program-charges-bots-for-each-crawl/

https://www.cloudflare.com/paypercrawl-signup/

https://www.cloudflare.com/press-releases/2025/cloudflare-just-changed-how-ai-crawlers-scrape-the-internet-at-large/

https://digitalwonderlab.com/blog/the-ai-paywall-era-a-turning-point-for-publishers-or-just-another-cat-and-mouse-game



This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit letsknowthings.substack.com/subscribe

More episodes from "Let's Know Things"