Squaring the Circle of Open Source AI without Data

The Open Source Initiative (OSI) continued their condescension towards the community today in a self-congratulatory post explaining How we passed the AI conundrums, like they were Alexander slashing through the Gordian knot (the article’s featured image is a modern equivalent, a Megaminx).

Turtles All The Way Down

The anonymous representative starts out by painting “some people” as extremists for demanding the same licensing as has been required by the Open Source Definition for decades. They then set up the strawman that “limiting Open source AI only to systems trainable on freely distributable data would relegate Open Source AI to a niche,” and then immediately slashed it down by claiming such “freely and legally shareable data is a tiny fraction of what is necessary”. This ignores the inconvenient truth that nobody’s still seriously asking for Open Data licensing (as much as I/we would prefer it), rather accepting the significant compromise of publicly accessible data that still guarantees the ability to exercise the four essential freedoms enumerated in the draft itself, at least with respect to their own creation. It’s not entirely unlike Debian Linux finally accepting to ship non-free firmware after two decades so you can actually run their software on your hardware, only we’re not even talking about shipping or sharing anything that’s not already publicly shared by others.

The argument also completely ignores the fact that more and more valuable Open Data is coming online every day, and that when Open Source was established decades ago there was only a “tiny fraction” then of the software that’s available now. Indeed, in the absence of a clear and strong Open Source Definition it’s unlikely that the industry would have curated such a large and valuable archive. Just yesterday AlphaFold won the Nobel Prize for its Creative Commons (CC-BY-SA) licensed database that “provides open access to over 200 million protein structure predictions to accelerate scientific research”. The reality is that a more ambitious Open Source AI definition in line with the existing understanding of what Open Source is will incentivise rather than deter the creation of new reusable datasets, for example with medical data provided with consent by patients (to murder another regular strawman).

Karsten Wade’s proposal to achieve consensus for the 1.0 release yesterday codifies this and resolves nearly all of the outstanding objections (including mine) simply by keeping Open (i.e., shareable) and Public (i.e., accessible) data while eliminating the two most problematic data classes: “Restricted (née Obtainable) and Unavailable (née Unshareable)”. Clearly you cannot exercise any/all of the four freedoms with data you can’t even access let alone afford, but as I noted yesterday in Proprietary Data Considered Harmful to Open Source AI, restricted datasets “including for fee” (e.g., stock photos, New York Times articles) are even more dangerous in that they are almost guaranteed to get those relying on the OSI’s definition nuked from orbit by data custodians — whose one job it is to exploit their government-granted monopoly over their datasets. At least public datasets like Common Crawl have already been subjected to sterilising sunlight, and anything that needed to be removed will have been removed by via the claims process in their terms of use — many eyes make bugs shallow after all. This Achilles heel alone is enough to render RC1 dead on arrival when the first actual enterprise lawyer looks at it, and if nothing else it must be abandoned for the sake of public safety (you’re welcome to cite this point later and I’ll be sure to post it where it will be seen).

The author then applies the false analogy of comparing training data to proprietary system libraries, forgetting that you can still run and verify the behaviour of a program like Emacs running on a proprietary operating system like macOS, but the same cannot be said for machine learning models trained on inaccessible data — the training data input is fundamentally tied to the system’s output while libraries are not.

Going for the trifecta, they also argue equivocation in shifting the definition of “Open Source” to accommodate the missing data by claiming they’re still sharing metadata. This change would very significantly alter the well-established meaning of “Open Source”, doing damage to that entire ecosystem as well — reason enough in and of itself for the board to reject the release candidate even if that meant releasing nothing at all. Incidentally, the sunk cost fallacy was also just argued on the twentieth town hall call, in that one of the three reasons given for releasing anything is because it’s been worked on for a while.

I’m still yet to extract an admission from anyone that they’re knowingly capitulating on the four freedoms and redefining “Open Source”, including on that call, but I’d love for someone to get the claim that the four freedoms are protected on record, and then have them justify it. The entire argument really is a stack of logical fallacies in a trench coat.

Conundrums?

The AI Conundrums refers to Stephen O’Grady of developer-focused industry analyst Redmonk‘s insightful post back in July, an ice age ago in the accelerating technological change of the AI industry. In the megamix article he covers the business models of commercial providers (ChatGPT, Claude, Gemini, etc.), the commoditisation capabilities of gateways like LiteLLM (product) and OpenRouter (service), and wraps up with a look at Open Source, AI, and Data. Regarding the draft definition‘s weasel-worded request for metadata (i.e., data about data) rather than the actual data on which the AI system was trained, O’Grady points out that “many smart and reasonable individuals with decades of open source experience regard this as insufficient” (you can see for yourself via this sampling of links).

“Julia Ferraioli, meanwhile, explicitly made the case […] that the current definition falls short, arguing in essence that without the data – and a sure way to replicate it – the draft AI definition cannot fulfill the four key freedoms that the traditional OSI definition does.” O’Grady concurs:

This is correct. If data is key to AI, and data may or not be replicated per these more lenient terms, then while users would be able to use or distribute the system, clearly the draft definition could not guarantee the right to study the full system or modify it deeply.

Note that the last word is doing a lot of heavy lifting here because of the common sleight of hand that providing enough [meta]data to enable any modification (e.g., fine-tuning) is sufficient to satisfy the freedom to modify, equating it to enabling all modifications or improvements: “If your right to modify a program is limited, in substance, to changes that someone else considers an improvement, that program is not free.”

O’Grady notes that “an OSI definition that does not require the inclusion of training data is problematic”, but that requiring “full training set availability” is also problematic, giving two reasons that I would argue are non-issues for public rather than open data:

  • Practicality: These datasets are large and unwieldly. Fortunately, they also tend to already be reliably hosted by third-parties like AWS. As I noted yesterday, for popular training datasets like Common Crawl, “this data is apparently so toxic it can’t possibly be used to train Open Source AI, and yet so clean AWS agreed to host it for free as a Public Data Set”.
  • Legality: “Authors may rely on training data that cannot be legally released under an open source license,” and while my preferred Open Source AI definition would require Open Data licenses, the far better compromise than throwing the baby out with the bathwater is accepting this is often impractical and loosening the requirement to simply demand the data be public (which again is typically hosted by third-parties).

O’Grady then shifts to outcomes, re-confirming the position that “strict adherence to the four freedoms clearly necessitates a full release of training data“. The OSI literally has one job, and that’s to protect those four freedoms, deemed essential for a reason, and ironically listed in the draft definition which then goes on to not protect them:

  • Use the system for any purpose and without having to ask for permission.
  • Study how the system works and inspect its components.
  • Modify the system for any purpose, including to change its output.
  • Share the system for others to use with or without modifications, for any purpose.

He then gets to the OSI’s purported “smoking gun” now appearing in every slide deck on every call and at every event trying to justify this sorry state of affairs:

If we assume, for example, that the definition requires full release of datasets, one thing is certain: in Julia’s words, it would be “a definition for which few existing systems qualify.” (OSI note: also less powerful and limited to specific domains)

Ignoring that injecting your own opinion is not how quotes work, and taking Julia Ferraioli — who you may remember as one of those “smart and reasonable individuals with decades of open source experience” advocating for the availability of data above — out of context to make the opposite point to her own clearly-stated opinion, the statement itself is conditional on requiring “full release of datasets”, which absolutely nobody is asking for.

Maybe Redmonk will do a follow-up, but I did ask them what they thought of the OSI [ab]using their article to argue for a less open Open Source AI definition and they had this to say. I’m going to let Julia wrap it up as she explains why we consider this issue so important — even existential for Open Source in an AI-dominated world — better than I could:

Open source software was a radical concept in the beginning. We didn’t get to where we are today by abiding by the status quo. We need to carry that forward with us into new domains, into new (or renewed, in the case of AI) technologies. We need to be bold and brave. We need to fight for openness and transparency.

9 Likes

“Some people believe that full unfettered access to all training data is paramount.”
Sorry if there is any culture gap, language gap, or misunderstanding, but the very first sentence reads like a disagreement with those “some people”.

That article reverted the logic. For an AI model, the training code is the “compiler”, while the “data” is the “code”. Please open any statistical learning, machine learning, or deep learning book, one will figure out the data (almost) fully decide what the model does to produce its outputs. A very typical example is K-NN (K-nearest neighbor. This model can be understood by a high-school student with appropriate education material), where the model directly makes all decisions based on its training data. What the training code does is simply converting the statistical knowledge from the data to a compact form defined by the training code.

In that sense, deleting a portion of training data and retraining a model is like “deleting a feature from emacs”. Deleting legally not-redistributable data is like “deleting an optional proprietary feature from emacs”. Having no data at all while having the training software is like trying to compile emacs from nothing.

How AI runs is defined by training and inference code, indeed. What AI produces is defined by the data.

The correct way to make an metaphor:

Object of Interest Emacs binary executable neural network
conceptual compiler GCC training code
conceptual source emacs.c, emacs.h training data

I hope the drafting process of OSAID is really backed by at least some people who really knows what AI is and how to build AI. OSD cannot be defined by people who could not build software. OSAID cannot be defined by people who could not build AI.

5 Likes

Regarding the table, an alternative (and maybe easier to understand) form may look like:

Output=(emacs binary executable) ← engine(GCC) input(emacs.c, emacs.h)
Output=(neural network) ← engine(training software) input(training data)

2 Likes

While I’m with you regarding that draft makes no sense as it stands by not treating training data as a “source” for AI models, it feels like your post overcomplicates things quite a bit. I personally had a hard time following what’s written there (ESL here). Does it really have to be that complicated? People may just decide not to bother reading and participating.

I also haven’t seen you mentioning techniques of smaller models fusion into a one large one, which might somewhat address the problem with model reproducibility from the source data with limited computational resources.

3 Likes

A post was split to a new topic: A collection of AI datasets hosted by IEEE