I have just shared my thoughts (less some deep links here that are available in the original due to forum limits) on why I believe the OSI board must not approve the proposed OSAID next month given the outstanding objections:
The Open Source Initiative (OSI) are seeking endorsements of their upcoming Open Source AI Definition (OSAID), which has worked its way through a series of drafts since the start of the year to land on v0.0.9. Any differences between this and RC 1 are likely to be minimal as the form notes they âwill double-check with [endorsers] to confirm [their] continued endorsement of any changes from v.0.0.9 to RC1â. Do not endorse this draft.
While the horse is being loaded into the starting gate, it hasnât bolted yet, so itâs not too late to advocate for a change in direction. Tom Callaway agrees âwe can always make the definition more permissive if we discover that we have been too ambitious with a definition, but that it is functionally impossible for us to be more ambitious laterâ.
Background
As a refresher, the OSIâs own Open Source Definition, which was originally derived from the Debian Free Software Guidelines (DFSG), requires (among other things) that âthe source code must be the preferred form in which a programmer would modify the programâ. Note that the Free Software Foundationâs own Free Software Definition predates the DFSG, but it was not derived from it. In a world where software was delivered in a compiled binary form which is infeasible to modify, Open Source gave users the blueprints to make changes as they saw fit, by way of OSI Approved licenses. This is good stuff and the OSI has played an important role in the industry to date.
Enter cloud computing almost 20 years ago now(!), and the transition from products to services that disrupted the IT industry (AI is set to disrupt every industry). For the Free and Open Source Software (FOSS) community this was a problem too because the âservice provider loopholeâ meant that software was no longer being âconveyedâ â a requirement to use it when it had to run on your machine â but merely executed on a server with controlled access provided. Attempts to address this by making viral licenses like the GNU General Public License (GPL) even more viral (e.g., Affero GPL aka AGPL) â triggering the requirement to redistribute source simply by accessing it over a network rather than distributing it â largely failed (thankfully).
Thatâs why I rounded up several of the early pioneers of cloud computing to form the Open Cloud Initiative (OCI), and through a consultative process, determined that âOpen Cloudâ would require open standard formats and interfaces; thereâs no point having transparent access (i.e., APIs) to opaque data (i.e., undocumented, obfuscated, or even encrypted formats), nor having no programmatic access to transparent data (i.e., open standard formats). Related definitions were included for âFree Cloudâ and âOpen Source Cloudâ, which we see a demand for again today with AI. The OSI declined my offer to take on this challenge at the time, but I understand they may yet â hopefully they get it right when and if they do. The context is useful for how we got to where we are.
Open Source AI
Artificial Intelligence (AI) on the other hand is something the OSI determined needed to be addressed this year and set out to do so with the first town hall meeting on 12 January 2024, based on early drafts from a private workgroup. From the slides, they rightly started by asking âWhat is the preferred form to make modifications to an AI system?â, noting that:
To be Open Source, an AI system needs to be available under legal terms that grant the freedoms to:
- Use the system for any purpose and without having to ask for permission.
- Study how the system works and inspect its components.
- Modify the system to change its recommendations, predictions or decisions to adapt to your needs.
- Share the system with or without modifications, for any purpose.
So far, so good. They also rightly define the components of an AI system as:
- Code Instruction for a computer to complete a task.
- Model Abstracted representation of what an AI system has learned from the training data.
- Data Information converted into a form efficient for processing and transfer.
So, in order to protect the four freedoms, we just need to make the three components available under approved licenses, right? Apparently, not.
Two-Legged Stool
By draft 0.0.6, the third leg of the stool, data, had been cut off and deemed ânot requiredâ but rather merely âappreciatedâ â a pointless observation for a litmus test. In draft 0.0.7, the data deficiency was acknowledged by way of its rebranding it to âdata transparencyâ, which was when I waded into the discussion:
I worry that while the definition (which should probably be labeled as such rather than âWhat is Open Source AIâ) requires that users be able to âmodify the system for any purposeâ (which is implied in the DFSG and implemented in its terms), the checklist makes the requisite inputs for said modifications (i.e., training data) optional but âappreciatedâ.
OSIâs Executive Director, Stefano Maffulli replied, referring me to an earlier thread from January on the elephant in the room and claiming that âthe issue is not that âmost models will not meet the definitionâ but that none will do.â Richard Fontana concurred, adding that âas a practical matter there will be few, if any, open source models other than perhaps some toy examples or research models of limited practical interest and utilityâ. This point is irrelevant though, as either the definition meets the standard set by the community long ago (whether or not any existing implementations are compliant out of the gate), or it does not, and I would join others in arguing that the proposed definition does not.
Existing Open Source AI
Stefano Zacchiroli agreed in the earlier thread, stating that âthe ability to redistribute the original training dataset should be a requirement for the freedom to modify an ML-based AI system, because short of that the amount of modifications one can do to such a system is much more limited.â He beat me to responding in the later thread that appropriately licensed training data sets like The Stack v2 (and Wikipedia, Wikidata, Wikimedia Commons, etc.) do exist and are already being used to create what I/we would consider to be true âOpen Source AIâ like StarCoder2. The âtruly openâ (their words, not mine) Open Language Model (OLMo) was also cited, demonstrating that truly Open Source AI is an achievable aim, provided we donât poison the well by lowering the well-established standards (irrespective of noted existing ab/use of the term âOpen Sourceâ in the machine learning community, which is also irrelevant). This âprior artâ meets the board approval criteria that it âProvides real-life examples: The definition must include relevant examples of AI systems that comply with it at the time of approval, so cannot have an empty set.â
Neither thread came to even a rough consensus, with folks on both sides of the debate maintaining their strongly-held positions. Various issues were raised around not being able to fix models trained on defamatory data, models being derivative works of training data and thus inheriting the license, or alternatively being mathematical calculations unprotectable by copyright, the practicality of hosting large data sets and risk of thirdâparty hosted data sets disappearing, the EUâs AI law exclusion for âopen weightsâ only models (which doesnât matter), whether we are weighing reproducibility too heavily or not enough (if not for reproducibility of binaries, what is the point of source code?), and whether byte-for-byte reproducibility is an achievable or even useful goal, the issue of open access but copyrighted data including Flickr images and ImageNet, and various other interesting but often irrelevant arguments (before devolving into a debate about OSI governance and the history of the OSD).
Weasel Words
By draft 0.0.8 the weasel word âinformationâ was introduced, defining the contrived term âdata informationâ (otherwise known as metadata, as in âdata about dataâ) to be âsufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar dataâ. The excuse given on behalf of these non-free models is that the training data is inconvenient to release or unable to be released for whatever reason, typically because itâs subject to proprietary copyrights (e.g., New York Times articles, YouTube transcripts), or because it contains personally identifiable information (e.g., healthcare records). Encouraging the creation and curation of free data sets rather than undermining them is precisely why we owe it to the community to meet our own well-established standards.
In current draft 0.0.9 the additional requirement that âData information shall be made available with licenses that comply with the Open Source Definitionâ was also introduced, but is too little, too late; metadata licensing is meaningless if the data itself is unavailable. A redundant and confusing attempt to disambiguate AI models (including weights) from AI weights was also added despite both code and weights being covered earlier. It is clear to me and many others that this version, which is set to become the release candidate and ultimately the 1.0 release at All Things Open on 28 October 2024, remains deficient and should not be adopted. Itâs like the process â not helped by the curious board approval criteria that it be âReady by October 2024: A usable version of the definition needs to be ready for approval by the board at the October board meetingâ rather than when itâs ready â went something like:
- We have to do something.
- This is something.
- We have to do this.
Why not both?
I would be remiss at this point not to acknowledge the carefully-considered compromise offered by fellow maintainer of the Personal Artificial Intelligence Operating System (pAI-OS) â hence my interest in the topic â and fellow open source old guard Karsten Wade, who presented a detailed proposal to satisfy both camps by bifurcating Open Source AI models into the default required to release the data (âD+â) and those with an âintegrity exceptionâ (âD-â). As discussed in the earlier thread, there is precedent for this dating back to the 1991 GPLv2 system library exemption for proprietary operating system components, but that was when it was not going to be possible to release an Open Source operating system without them, which is no longer the case and was never the case here. A related exception was the creation of the Lesser GPL (LGPL), with which an author can allow their code to be linked with other non-free software.
While this proposal has some traction with another opponent of the proposed definition, Steven Pousty pointed out that the âOSI has made no such distinction with software so why should they with AI/ML.â I also prefer to keep solutions as simple as possible (but no simpler), and in any case the OSAID needs to function as a binary litmus test like the Open Source Definition (OSD): either youâre in (i.e., âD+â) or youâre out. Fortunately, thereâs a perfectly good existing term that describes âD-â: Open Weights.
Toxic Candy
We owe nothing to those that donât commit to the cause and need not pander to the wants of their business models, but one could still make the case for a pejoratively-named âLesserâ license for those who donât or canât release training data. Mo Zhou, a fellow Debian developer who drafted Debianâs ML Policy and author of the earlier thread, has a special name for these appealing yet problematic models: Toxic Candy.
Toxic Candy Model is a model satisfying ALL the following conditions:
- The model itself is FOSS-Licensed & DFSG-compliant;
- trained from unknown, private, or non-free datasets or simulators;
- corresponding training program or script is present and complete;
Itâs worth referencing the definition of a free (as in freedom, not as in beer) model, as Debian appear to have already got it right half a decade ago:
Free Model is a model satisfying ALL the following conditions:
- The model itself is FOSS-Licensed & DFSG-compliant;
- trained from explicitly FOSS-licensed & DFSG-compliant datasets (e.g. for supervised or unsupervised learning) or simulators (e.g. for reinforcement learning), and the dataset is publicly available to anonymous users;
- corresponding training program is present and complete;
While there are a few OSI Approved licenses that are unacceptable under Debianâs DFSG (OSL, RPL, CPAL, etc.), the fact that most models blessed as kosher by the OSAID today would be permanently ineligible for inclusion in Debian should be a five-alarm fire for those pushing it. Thereâs good reason why Debian wonât distribute AI models any time soon, but thatâs no excuse to lower the bar.
OpenAI on Open AI
Being one to eat my own dogfood, I had a chat with the cloud brain (ChatGPT), asking it:
What is the equivalent of the âpreferred formâ for open source software (i.e., source code, as in âthe source code must be the preferred form in which a programmer would modify the programâ) in the context of Artificial Intelligence?
This is what it had to say:
In summary, the âpreferred formâ of an AI system for modification would include:
- The source code for the model architecture and training process.
- The raw or cleaned training data, or at least a detailed description of it.
- The modelâs learned weights or checkpoints.
- The configuration settings for training, including hyperparameters.
- Scripts for training, evaluating, and fine-tuning the model.
- Documentation that contextualizes the modelâs function, performance, and limitations.
All these components together form the most modifiable and transparent version of an AI system, analogous to source code in traditional software.
Addressing training data specifically, it had this to say (closing the loophole it opened in the bullet point above):
Training Data: In many AI systems, the model is only as good as the data it was trained on. Thus, providing access to the âpreferred formâ would mean giving access to the training data in its raw or cleaned form, as well as details about any preprocessing steps. Transparency about data sources and curation is critical for reproducibility and auditability.
Preferred Form
If we focus on the Open Source Definitionâs critical âpreferred formâ term â which prevents folks from distributing source code in Japanese on punch cards to meet the letter but not the spirit of the law â then it is a question for practitioners: if tasked with modifying a model then would the preferred form be the input to the training process (i.e., the data), or the output from it (i.e., the model), as Stefano Zacchiroli stated:
The main argument for including data in the definition is that they are part of the preferred form for modification of an AI system. We can debate (ideally, together with the actors who are actually creating and modifying AI systems) whether it is true or not that training data is part of that âpreferred formâ, but it is undeniable that if they are, then the definition of Open Source AI must include them.
Sure, you can make minor modifications with the model only (e.g., fine tuning), but for major modifications (e.g., removing, substituting, or supplementing parts of the corpus) you absolutely need the data. Any restriction on the four freedoms above must render the candidate inadmissible, and it is crystal clear that lack of data restricts the ability to study and modify the system at least. I would further argue that it restricts its use â as deploying black boxes of unknown and/or unverifiable provenance is unacceptable to many users â as well as the ability to share in the traditional sense of the term Open Source (i.e., with those same freedoms).
Historical Mistake in the Making
Indeed, by cementing the status quo it is possible to do more harm than good, and I would cite the precautionary principle in saying that the onus is on those advocating this watered-down approach to address the risk to the Open Source community and fledgling ecosystem of truly Open Source models. Given the sustained, well-justified objections, I would argue that the third board approval criteria â that it be âSupported by diverse stakeholders: The definition needs to have approval by end users, developers, deployers and subjects of AI, globallyâ â has not been met. No amount of âendorsersâ will change that.
At the IETF, lack of disagreement is more important than agreement, and âcoming to consensus is when everyone (including the person making the objection) comes to the conclusion that either the objections are valid, and therefore make a change to address the objection, or that the objection was not really a matter of importance, but merely a matter of taste.â
To quote Mo Zhou:
If OSAID does not require original training dataset, Iâd say it will become a historical mistake.
I agree, and whether or not you do too, you should join the discussion.