RSS co-creator launches new protocol for AI data licensing



In the wake of Anthropic’s $1.5 billion copyright settlement, the AI industry is coming to terms with its training data problem. There are as many as 40 other pending cases that seek damages for unlicensed data — including one that takes Midjourney to court for creating images of Superman.

Without some kind of licensing system, AI companies could face an avalanche of copyright lawsuits that some worry will set the industry back permanently.

Now, a group of technologists and web publishers has launched a system that would enable data licensing at massive scale — provided AI companies take them up on it. Called Real Simple Licensing (RSL), the system is already being backed by major web publishers like Reddit, Quora and Yahoo. The question now is if that momentum will be enough to bring major AI labs to the bargaining table.

According to RSL co-founder Eckart Walther, who also co-created the RSS standard, the goal was to create a training-data licensing system that could scale across the internet. “We need to have machine-readable licensing agreements for the internet,” Walther told TechCrunch. “That’s really what RSL solves.”

For years, groups like the Dataset Providers Alliance have been pushing for clearer collection practices, but RSL is the first attempt at a technical and legal infrastructure that could make it work in practice. On the technical side, the RSL Protocol lays out specific licensing terms a publisher can set for their content, whether that means AI companies need a custom license or to adopt Creative Commons provisions. Participating websites will include the terms as part of their “robots.txt” file in a prearranged format, making it straightforward to identify which data falls under which terms.

On the legal side, the RSL team has established a collective licensing organization, the RSL Collective, that can negotiate terms and collect royalties, similar to ASCAP for musicians or MPLC for films. As in music and film, the goal is to give licensors a single point of contact for paying royalties, and provide rightsholders a way to set terms with dozens of potential licensors at once.

A host of web publishers have already joined the collective, including Yahoo, Reddit, Medium, O’Reilly Media, Ziff Davis (owner of Mashable and Cnet), Internet Brands (owner of WebMD), People Inc. and The Daily Beast. Others, like Fastly, Quora and Adweek, are supporting the standard without joining the collective.

Techcrunch event

San Francisco
|
October 27-29, 2025

Notably, the RSL Collective includes some publishers that already have licensing deals — most notably Reddit, which receives an estimated $60 million a year from Google for use of its training data. There’s nothing stopping companies from cutting their own deals within the RSL system, just as Taylor Swift can set special terms for licensing while still collecting royalties through ASCAP. But for publishers too small to draw their own deals, RSL’s collective terms are likely to be the only option.

But while it’s easy enough to determine when a song has been played, AI models pose unique challenges when it comes to figuring out when royalties are due for a specific piece of training data. The issue is simplest for a product like Google’s AI Search Abstracts, which draw data from the web in real time and maintain strict attribution for each fact.

But if training isn’t logged when it occurs, it can be nearly impossible to confirm that a given document was ingested into a LLM. It’s particularly challenging if publishers ask to be paid per-inference rather than receiving a blanket fee, an option offered by one of the stock RSL licenses.

Still, RSL’s creators believe AI companies will be able to manage the difficulty. “Some of the licensing agreements they’ve already done have required them to be able to report on it, so it’s possible,” says Doug Leeds, a co-founder of RSL and former CEO of IAC Publishing. “It doesn’t have to be perfect. It just has to be good enough to get people paid.”

The bigger question is whether AI companies will embrace the system. As the success of companies like ScaleAI and Mercor shows, frontier labs have no problem paying for data, but the web has traditionally been seen as a source for cheap, low-quality data. With datasets like the Common Crawl already available, it may be a challenge to extract royalties from something labs are used to getting for free. And as the recent dustup between CloudFlare and Perplexity shows, it’s not straightforward to tell the difference between web-scraping and machine-enhanced browsing.

When I put the question to Leeds, he pointed to recent comments from AI leaders calling for a system like RSL — most notably from Sundar Pichai at last year’s Dealbook Summit. Whether the calls for a licensing system are earnest or not, the RSL team plans to hold them to it. “They have said outwardly to everyone, something like this needs to exist,” Leeds told me. “We need a protocol. We need a system.”

Now, they may get one.




Source