OpenAI has never revealed exactly which data it used to train Sora, its video-generating AI. But from the looks of it, at least some of the data might’ve come from Twitch streams and walkthroughs of games.
Sora launched on Monday, and I’ve been playing around with it for a bit (to the extent the capacity issues will allow). From a text prompt or image, Sora can generate up to 20-second-long videos in a range of aspect ratios and resolutions.
When OpenAI first revealed Sora in February, it alluded to the fact that it trained the model on Minecraft videos. So, I wondered, what other video game playthroughs might be lurking in the training set?
Quite a few, it seems.
Sora can generate a video of what’s essentially a Super Mario Bros. clone (if a glitchy one):
It can create gameplay footage of a first-person shooter that looks inspired by Call of Duty and Counter-Strike:
And it can spit out a clip showing an arcade fighter in the style of a ’90s Teenage Mutant Ninja Turtle game:
Sora also appears to have an understanding of what a Twitch stream should look like — implying that it’s seen a few. Check out the screenshot below, which gets the broad strokes right:
Another noteworthy thing about the screenshot: It features the likeness of popular Twitch streamer Raúl Álvarez Genes, who goes by the name Auronplay — down to the tattoo on Genes’ left forearm.
Auronplay isn’t the only Twitch streamer Sora seems to “know.” It generated a video of a character similar in appearance (with some artistic liberties) to Imane Anys, better known as Pokimane.
Granted, I had to get creative with some of the prompts (e.g. “italian plumber game”). OpenAI has implemented filtering to try to prevent Sora from generating clips depicting trademarked characters. Typing something like “Mortal Kombat 1 gameplay,” for example, won’t yield anything resembling the title.
But my tests suggest that game content may have found its way into Sora’s training data.
OpenAI has been cagey about where it gets training data from. In an interview with The Wall Street Journal in March, OpenAI’s then-CTO, Mira Murati, wouldn’t outright deny that Sora was trained on YouTube, Instagram, and Facebook content. And in the tech specs for Sora, OpenAI acknowledged it used “publicly available” data, along with licensed data from stock media libraries like Shutterstock, to develop Sora.
OpenAI also didn’t respond to a request for comment.
If game content is indeed in Sora’s training set, it could have legal implications — particularly if OpenAI builds more interactive experiences on top of Sora.
“Companies that are training on unlicensed footage from video game playthroughs are running many risks,” Joshua Weigensberg, an IP attorney at Pryor Cashman, told TechCrunch. “Training a generative AI model generally involves copying the training data. If that data is video playthroughs of games, it’s overwhelmingly likely that copyrighted materials are being included in the training set.”
Probabilistic models
Generative AI models like Sora are probabilistic. Trained on a lot of data, they learn patterns in that data to make predictions — for example, that a person biting into a burger will leave a bite mark.
This is a useful property. It enables models to “learn” how the world works, to a degree, by observing it. But it can also be an Achilles’ heel. When prompted in a specific way, models — many of which are trained on public web data — produce near-copies of their training examples.
That has understandably displeased creators whose works have been swept up in training without their permission. An increasing number are seeking remedies through the court system.
Microsoft and OpenAI are currently being sued over allegedly allowing their AI tools to regurgitate licensed code. Three companies behind popular AI art apps, Midjourney, Runway, and Stability AI, are in the crosshairs of a case that accuses them of infringing on artists’ rights. And major music labels have filed suit against two startups developing AI-powered song generators, Udio and Suno, of infringement.
Many AI companies have long claimed fair use protections, asserting that their models create transformative — not plagiaristic — works. Suno makes the case, for example, that indiscriminate training is no different from a “kid writing their own rock songs after listening to the genre.”
But there are certain unique considerations with game content, says Evan Everist, an attorney at Dorsey & Whitney specializing in copyright law.
“Videos of playthroughs involve at least two layers of copyright protection: the contents of the game as owned by the game developer, and the unique video created by the player or videographer capturing the player’s experience,” Everist told TechCrunch in an email. “And for some games, there’s a potential third layer of rights in the form of user-generated content appearing in software.”
Everist gave the example of Epic’s Fortnite, which lets players create their own game maps and share them for others to use. A video of a playthrough of one of these maps would concern no fewer than three copyright holders, he said: (1) Epic, (2) the person using the map, and (3) the map’s creator.
“Should courts find copyright liability for training AI models, each of these copyright holders would be potential plaintiffs or licensing sources,” Everist said. “For any developers training AI on such videos, the risk exposure is exponential.”
Weigensberg noted that games themselves have many “protectable” elements, like proprietary textures, that a judge might consider in an IP suit. “Unless these works have been properly licensed,” he said, “training on them may infringe.”
TechCrunch reached out to a number of game studios and publishers for comment, including Epic, Microsoft (which owns Minecraft), Ubisoft, Nintendo, Roblox, and Cyberpunk developer CD Projekt Red. Few responded — and none would give an on-the-record statement.
“We won’t be able to get involved in an interview at the moment,” a spokesperson for CD Projekt Red said. EA told TechCrunch it “didn’t have any comment at this time.”
Risky outputs
It’s possible that AI companies could prevail in these legal disputes. The courts may decide that generative AI has a “highly convincing transformative purpose,” following the precedent set roughly a decade ago in the publishing industry’s suit against Google.
In that case, a court held that Google’s copying of millions of books for Google Books, a sort of digital archive, was permissible. Authors and publishers had tried to argue that reproducing their IP online amounted to infringement.
But a ruling in favor of AI companies wouldn’t necessarily shield users from accusations of wrongdoing. If a generative model regurgitated a copyrighted work, a person who then went and published that work — or incorporated it into another project — could still be held liable for IP infringement.
“Generative AI systems often spit out recognizable, protectable IP assets as output,” Weigensberg said. “Simpler systems that generate text or static images often have trouble preventing the generation of copyrighted material in their output, and so more complex systems may well have the same problem no matter what the programmers’ intentions may be.”
Some AI companies have indemnity clauses to cover these situations, should they arise. But the clauses often contain carve-outs. For example, OpenAI’s applies only to corporate customers — not individual users.
There’s also risks beside copyright to consider, Weigensberg says, like violating trademark rights.
“The output could also include assets that are used in connection with marketing and branding — including recognizable characters from games — which creates a trademark risk,” he said. “Or the output could create risks for name, image, and likeness rights.”
The growing interest in world models could further complicate all this. One application of world models — which OpenAI considers Sora to be — is essentially generating video games in real time. If these “synthetic” games resemble the content the model was trained on, that could be legally problematic.
“Training an AI platform on the voices, movements, characters, songs, dialogue, and artwork in a video game constitutes copyright infringement, just as it would if these elements were used in other contexts,” Avery Williams, an IP trial lawyer at McKool Smith, said. “The questions around fair use that have arisen in so many lawsuits against generative AI companies will affect the video game industry as much as any other creative market.”