Proprietary Data Considered Harmful to Open Source AI

OSI’s Definition of Open Source AI Raises Critical Legal Concerns for Developers and Businesses, says Luca Antiga, AI developer of PyTorch fame and fellow lobbyist for Big Freedom. He literally wrote the book on the topic, so while every voice is equal in this ongoing asinine debate, some are more equal than others. When people like Luca talk, I listen, which is more than can be said for the anonymous Open Source Initiative (OSI) representative who publicly took him to task over it on LinkedIn, making sure to @-mention him, his publication, and — I kid you not — his employer. I hope an apology will be forthcoming, because this goes way beyond the bounds of professional discourse and is frankly unbecoming of an organisation that claims to champion open dialogue and collaboration.

On the subject of the recently released release candidate for the (too) soon to be released Open Source AI Definition (OSAID), Luca “believes it leaves a critical question unanswered, particularly for developers and businesses looking to adopt open source AI models with confidence.” The “elephant in the room”, he suggests, is that models “may still be considered open source if all information about data sources” is shared (without requiring access to the data sources themselves), and that the data needs to be open for a derivative of it to be open. This has been the case for Open Source to date, and he gives some examples to that effect.

Indeed, ideally training datasets would be made available under an Open Data license, and while there is a large and growing collection of such pristine specimens, the real world is messy. Extremely popular sources like Common Crawl — basically a dump of the entire Internet — don’t own the copyrights to license it to you even if they wanted to, instead making it available for free under their terms of use. This data is apparently so toxic it can’t possibly be used to train Open Source AI, and yet so clean AWS agreed to host it for free as a Public Data Set (cheers!). The good news is that while this post will eventually end up in there, I’m not about to try to sue you for using it to train your AI, and even if I did, I’d be unlikely to prevail. An artist who gazes upon the Mona Lisa is bound to incorporate its teachings consciously or subconsciously into their future work, and yet the remnants of it among such a massive corpus of all public works will be undetectably small.

This right here is the compromise hiding in plain sight that would make most of the objections go away overnight (including my own). Require training datasets but accept Open Access data, even if only on a transitional basis, all the while encouraging the creation of truly open foundations for future generations. The spawn of these could end up in holy sanctuaries like Debian Linux‘s main repository, while those with the taint of toxic waste go in contrib or non-free. That will likely eventuate as the de facto two-tier approach that’s also been proposed several times, based on the Lesser, now Library GPL precedent. Debian, home of the Debian Free Software Guidelines (DFSG) on which the OSI’s own Open Source Definition was based, has spent the last twenty years trying to keep binary blobs of proprietary firmware out of the Linux kernel, only to pragmatically allow them into the installer in 2022, voting to modify the Debian Social Contract in order to do so. If they can stomach it to avoid irrelevance, so can we.

Back to Luca, who the OSI publicly scolded for giving “an errant interpretation of a footnote in RC1”. As usual, while the big print giveth (“Parameters shall be made available under OSI-approved terms“), the small print taketh away:

The Open Source AI Definition does not take any stance as to whether model parameters require a license, or any other legal instruments, and whether they can be legally controlled by any such instruments once disclosed and shared.”

Apparently this linguistic car crash — which starts by saying the exact opposite of the claim it’s attempting to clarify — is trying to say that the OSAID “does not require nor advocates a specific legal mechanism for assuring that the model parameters are freely available to all”. So the beating hearts of all AI systems “may be free by their nature”, or could require a license or some other “instrument” that “might become clearer over time”, conveniently after their self-imposed deadline of the All Things Open conference on 28 October. The same community that famously used copyright against itself in inventing copyleft doesn’t even know if the document they’re rushing out the door half-baked is worth the paper it’s not printed on for protecting the Model, or if it’s even required at all given they’re not demanding the Data, and the Code is already covered by OSI Approved licenses?

The OSI then cite another word salad from the FAQ — which is an informative reference that has absolutely no bearing on the normative definition — asking (but not answering) “Are model parameters copyrightable?”, and also linking to a paper that wonders “Is Numerical Data Copyrightable?”. Finally, the origin of the red herring regularly used to kick up dust about unanswered and unanswerable — conveniently until after the cat is out of the bag — questions has been revealed, so let’s take a closer look at that too:

Essentially, this process entails transformation of copyright protected works, for example, a short story, into numerical values. Does copyright extend to this transformed subject matter?

You know what else does that? Digital cameras. And you know what’s still protected by copyright? Digital photos. If you don’t believe me then by all means go train your fancy new AI model on commercial stock photos licensed “including for fee” (which is allowed under the release candidate) from Adobe, Getty, etc. — whose only job is to monetise those datasets — and see how long it is before you get sued out of existence. Go to Jail. Do Not Pass Go, Do Not Collect $200. Jail.

You know how I know that’s what’s going to happen? Because The Times Sue[d] OpenAI and Microsoft Over A.I. Use of Copyrighted Work, arguing that “millions of articles from The New York Times were used to train chatbots that now compete with it” and demanding “destruction […] of all GPT or other LLM models
and training sets that incorporate Times Works.” Data custodians cannot possibly allow you to melt down and hand out their crown jewels, and any (subscription-only) license they eventually grant is likely to be so expensive as to make an absolute mockery of Open Source’s promises of No Discrimination Against Persons or Groups [or] Fields of Endeavor. The way I see it is that the current release candidate is like an open manhole cover in a busy street, and any individual or business adopting a model certified by it risks a one-way trip into the sewer.

Luca has since replied saying his “main intention was in fact to spark discussion [like this] in the open [because] there is a diverse set of stakeholders who might not be paying attention, and this is an opportunity to engage in the process,” concluding that “there is a chasm in what the general public perceives as being “open source” and what the definition is proposing, and it needs clarification.” Rather than calling for “more light than heat in these situations”, the OSI’s leadership would do well to take heed of the rapidly growing group of experts bearing torches, as well as his observation that “more precise definitions will likely emerge elsewhere to fill the gap [as] Open Source AI can only move forward toward widespread enterprise adoption with this understanding.”

Earlier tonight we should have discussed this and the many other unresolved objections from the community on the 20th(!) townhall on the topic, but it didn’t happen for whatever reason so we’re back to spilling digital ink. I should really be spending my Kwaai time coding the Personal Artificial Intelligence Operating System (pAI-OS) — an Open Source project I hope will be a beneficiary rather than victim of this process — but this is that important and inexplicably urgent.

1 Like

I would personally approve of an OSAID definition that only allows training data to be either OSD-compliant (at best) or “publicly available” (at worst) — I’ve proposed this in the past as well, near the early days of the OSAID process.

But note that even that would not be enough to make OSAID-compliant training data be hostable in Debian’s non-free sections (let alone contrib, which is not even an option here). Crucially, such an OSAID will not guarantee the right to redistribute publicly available data to 3rd parties, like Debian. (On that front Common Crawl itself is betting on the “fair use” card, which has not been tested in court, AFAIK, and is not something that generalizes well internationally outside of the US anyway.)

2 Likes

Using the various contents included in Common Crawl for AI training purposes has been permitted under Japanese copyright law since 2009. In other words, no license is required for the data contained in Common Crawl. The governing law for copyright is determined by the place of use, so using Common Crawl for AI training is entirely free as long as the activity occurs in Japan. Japan is a country that follows statutory law, and as this provision has been in operation without issue for 15 years, it can be said that it has been sufficiently tested.

2 Likes

Conversation and debate around the OSAID work is a good thing.

Remember though that the discussion forums are one part of one conversation in a much wider and longer consultation and design process, and one that is bringing different voices and perspectives into the open source community. Consultation and co-design processes that are sufficiently wide and broad should expect to encounter disagreement. It means we are considering all the angles and that is important.

It’s clear that there is a strength of personal feeling being expressed in this post. Let’s make sure we all take a breath every now and again and continue to keep things constructive and collegiate for the benefit of the bigger piece of work.

1 Like

The problem is that the “different voices and perspectives” that OSI is bringing “into the open source community” behind the scenes are not addressing the issues reported for the latest drafts and RC1 of the “Open Source AI” definition.

And given that OSI’s “inclusive approach” to the “co-design process” included Meta and enabled Meta employees to cancel other voices to obtain the exclusion of training data, to any external observer the whole process look severely biased towards interests that have nothing to do with open source.

And maybe I’m connecting the dots in the wrong way, but guess what?

The interests that OSI is trying to defend so desperately are totally aligned with those of their most generous sponsors right now: Google, Microsoft and GitHub (aka Open AI owner), Intel (@Mer employer) and Meta itself (who blatantly exploited the “co-design process”).

And yet such Big Techs’ arguments in support of the currently flawed definition stay behind the scene.

Why?

Don’t get me wrong, if OSI is so fond of Big Tech arguments, I’m sure they are pretty solid and valuable.

That’s why we are asking in public how they address the issues we reported.

Because to an outside observer, secret arguments, however valid, are indistinguishable from missing arguments.

Of notice is the fact some of authors of the MOF paper, that has been extensively used in this discussion, do not seem to agreed with the current draft.

1 Like

The authors of the MOF likely believe that OSAID is leaning too closely toward the domain of open science. I don’t think such opinions are unusual at all.