We finally have an ‘official’ definition for open source AI



There’s finally an “official” definition of open source AI.

The Open Source Initiative (OSI), a long-running institution aiming to define and “steward” all things open source, today released version 1.0 of its Open Source AI Definition (OSAID). The product of several years of collaboration with academia and industry, the OSAID is intended to offer a standard by which anyone can determine whether AI is open source — or not.

You might be wondering — as this reporter was — why consensus matters for a definition of open source AI. Well, a big motivation is getting policymakers and AI developers on the same page, said OSI EVP Stefano Maffulli.

“Regulators are already watching the space,” Maffulli told TechCrunch, noting that bodies like the European Commission have sought to give special recognition to open source. “We did explicit outreach to a diverse set of stakeholders and communities — not only the usual suspects in tech. We even tried to reach out to the organizations that most often talk to regulators in order to get their early feedback.”

Open AI

To be considered open source under the OSAID, an AI model has to provide enough information about its design so that a person could “substantially” recreate it. The model must also disclose any pertinent details about its training data, including the provenance, how the data was processed, and how it can be obtained or licensed.

“An open source AI is an AI model that allows you to fully understand how it’s been built,” Maffulli said. “That means that you have access to all the components, such as the complete code used for training and data filtering.”

The OSAID also lays out usage rights developers should expect with open source AI, like the freedom to use the model for any purpose and modify it without having to ask anyone’s permission. “Most importantly, you should be able to build on top,” added Maffulli.

The OSI has no enforcement mechanisms to speak of. It can’t pressure developers to abide by or follow the OSAID. But it does intend to flag models described as “open source” but which fall short of the definition.

“Our hope is that when someone tries to abuse the term, the AI community will say, ‘We don’t recognize this as open source,’ and it gets corrected,” Maffulli said. Historically, this has had mixed results, but it isn’t entirely without effect.

Many startups and big tech companies, most prominently Meta, have employed the term “open source” to describe their AI model release strategies — but few meet the OSAID’s criteria. For example, Meta mandates that platforms with over 700 million monthly active users request a special license to use its Llama models.

Maffulli has been openly critical of Meta’s decision to call its models “open source.” After discussions with the OSI, Google and Microsoft agreed to drop their use of the term for models that aren’t fully open, but Meta hasn’t, he said.

Stability AI, which has long advertised its models as “open,” requires that businesses making more than $1 million in revenue obtain an enterprise license. And French AI upstart Mistral’s license bars the use of certain models and outputs for commercial ventures.

study last August by researchers at the Signal Foundation, the nonprofit AI Now Institute, and Carnegie Mellon found that many “open source” models are basically open source in name only. The data required to train the models is kept secret, the compute power needed to run them is beyond the reach of many developers, and the techniques to fine-tune them are intimidatingly complex.

Instead of democratizing AI, these “open source” projects tend to entrench and expand centralized power, the study’s authors concluded. Indeed, Meta’s Lllama models have racked up hundreds of millions of downloads, and Stability claims that its models power up to 80% of all AI-generated imagery.

Dissenting opinions

Meta disagrees with this assessment, unsurprisingly — and takes issue with the OSAID as written (despite having participated in the drafting process). A spokesperson defended the company’s license for Llama, arguing that the terms — and accompanying acceptable use policy — act as guardrails against harmful deployments.

Meta also said it’s taking a “cautious approach” to sharing model details, including details about training data, as regulations like California’s training transparency law evolve.

“We agree with our partner the OSI on many things, but we, like others across the industry, disagree with their new definition,” the spokesperson said. “There is no single open source AI definition, and defining it is a challenge because previous open source definitions do not encompass the complexities of today’s rapidly advancing AI models. We make Llama free and openly available, and our license and acceptable use Policy help keep people safe by having some restrictions in place. We will continue working with the OSI and other industry groups to make AI more accessible and free responsibly, regardless of technical definitions.”‘

The spokesperson pointed to other efforts to codify “open source” AI, like the Linux Foundation’s suggested definitions, the Free Software Foundation’s criteria for “free machine learning applications,” and proposals from other AI researchers.

Meta, incongruously enough, is one of the companies funding the OSI’s work — along with tech giants like Amazon, Google, Microsoft, Cisco, Intel, and Salesforce. (The OSI recently secured a grant from the nonprofit Sloan Foundation to lessen its reliance on tech industry backers.)

Meta’s reluctance to reveal training data likely has to do with the way its — and most — AI models are developed.

AI companies scrape vast amounts of images, audio, videos, and more from social media and websites, and train their models on this “publicly available data,” as it is usually called. In today’s cut-throat market, a company’s methods of assembling and refining datasets are considered a competitive advantage, and companies cite this as one of the main reasons for their nondisclosure.

But training data details can also paint a legal target on developers’ backs. Authors and publishers claim that Meta used copyrighted books for training. Artists have filed suits against Stability for scraping their work and reproducing it without credit, an act they compare to theft.

It’s not tough to see how the OSAID could be problematic for companies trying to resolve lawsuits favorably, especially if plaintiffs and judges find the definition compelling enough to use in court.

Open questions

Some suggest the definition doesn’t go far enough, for instance in how it deals with proprietary training data licensure. Luca Antiga, the CEO of Lightning AI, points out that a model may meet all of the OSAID’s requirements despite the fact that the data used to train it isn’t freely available. Is it “open” if you have to pay thousands to inspect the private stores of images that a model’s creators paid to license?

“To be of practical value, especially for businesses, any definition of open source AI needs to give reasonable confidence that what is being licensed can be licensed for the way that an organization is using it,” Antiga told TechCrunch. “By neglecting to deal with licensing of training data, the OSI is leaving a gaping hole that will make terms less effective in determining whether OSI-licensed AI models can be adopted in real-world situations.”

In version 1.0 of the OSAID, the OSI also doesn’t address copyright as it pertains to AI models, and whether granting a copyright license would be enough to ensure a model satisfies the open source definition. It’s not clear yet whether models — or components of models — can be copyrighted under current IP law. But if the courts decide they can be, the OSI suggests new “legal instruments” may be needed to properly open source IP-protected models.

Maffulli agreed that the definition will need updates — perhaps sooner than later. To this end, the OSI has established a committee that’ll be responsible for monitoring how the OSAID is applied, and proposing amendments for future versions.

“This isn’t the work of lone geniuses in a basement,” he said. “It’s work that’s being done in the open with wide stakeholders and different interest groups.”




Source