The Open Source(ish) AI Definition (OSAID)

I have just shared my thoughts (less some deep links here that are available in the original due to forum limits) on why I believe the OSI board must not approve the proposed OSAID next month given the outstanding objections:

The Open Source Initiative (OSI) are seeking endorsements of their upcoming Open Source AI Definition (OSAID), which has worked its way through a series of drafts since the start of the year to land on v0.0.9. Any differences between this and RC 1 are likely to be minimal as the form notes they “will double-check with [endorsers] to confirm [their] continued endorsement of any changes from v.0.0.9 to RC1”. Do not endorse this draft.

While the horse is being loaded into the starting gate, it hasn’t bolted yet, so it’s not too late to advocate for a change in direction. Tom Callaway agrees “we can always make the definition more permissive if we discover that we have been too ambitious with a definition, but that it is functionally impossible for us to be more ambitious later”.

Background

As a refresher, the OSI’s own Open Source Definition, which was originally derived from the Debian Free Software Guidelines (DFSG), requires (among other things) that “the source code must be the preferred form in which a programmer would modify the program”. Note that the Free Software Foundation’s own Free Software Definition predates the DFSG, but it was not derived from it. In a world where software was delivered in a compiled binary form which is infeasible to modify, Open Source gave users the blueprints to make changes as they saw fit, by way of OSI Approved licenses. This is good stuff and the OSI has played an important role in the industry to date.

Enter cloud computing almost 20 years ago now(!), and the transition from products to services that disrupted the IT industry (AI is set to disrupt every industry). For the Free and Open Source Software (FOSS) community this was a problem too because the “service provider loophole” meant that software was no longer being “conveyed” — a requirement to use it when it had to run on your machine — but merely executed on a server with controlled access provided. Attempts to address this by making viral licenses like the GNU General Public License (GPL) even more viral (e.g., Affero GPL aka AGPL) — triggering the requirement to redistribute source simply by accessing it over a network rather than distributing it — largely failed (thankfully).

That’s why I rounded up several of the early pioneers of cloud computing to form the Open Cloud Initiative (OCI), and through a consultative process, determined that “Open Cloud” would require open standard formats and interfaces; there’s no point having transparent access (i.e., APIs) to opaque data (i.e., undocumented, obfuscated, or even encrypted formats), nor having no programmatic access to transparent data (i.e., open standard formats). Related definitions were included for “Free Cloud” and “Open Source Cloud”, which we see a demand for again today with AI. The OSI declined my offer to take on this challenge at the time, but I understand they may yet — hopefully they get it right when and if they do. The context is useful for how we got to where we are.

Open Source AI

Artificial Intelligence (AI) on the other hand is something the OSI determined needed to be addressed this year and set out to do so with the first town hall meeting on 12 January 2024, based on early drafts from a private workgroup. From the slides, they rightly started by asking “What is the preferred form to make modifications to an AI system?”, noting that:

To be Open Source, an AI system needs to be available under legal terms that grant the freedoms to:

  • Use the system for any purpose and without having to ask for permission.
  • Study how the system works and inspect its components.
  • Modify the system to change its recommendations, predictions or decisions to adapt to your needs.
  • Share the system with or without modifications, for any purpose.

So far, so good. They also rightly define the components of an AI system as:

  • Code Instruction for a computer to complete a task.
  • Model Abstracted representation of what an AI system has learned from the training data.
  • Data Information converted into a form efficient for processing and transfer.

So, in order to protect the four freedoms, we just need to make the three components available under approved licenses, right? Apparently, not.

Two-Legged Stool

By draft 0.0.6, the third leg of the stool, data, had been cut off and deemed “not required” but rather merely “appreciated” — a pointless observation for a litmus test. In draft 0.0.7, the data deficiency was acknowledged by way of its rebranding it to “data transparency”, which was when I waded into the discussion:

I worry that while the definition (which should probably be labeled as such rather than “What is Open Source AI”) requires that users be able to “modify the system for any purpose” (which is implied in the DFSG and implemented in its terms), the checklist makes the requisite inputs for said modifications (i.e., training data) optional but “appreciated”.

OSI’s Executive Director, Stefano Maffulli replied, referring me to an earlier thread from January on the elephant in the room and claiming that “the issue is not that ‘most models will not meet the definition’ but that none will do.” Richard Fontana concurred, adding that “as a practical matter there will be few, if any, open source models other than perhaps some toy examples or research models of limited practical interest and utility”. This point is irrelevant though, as either the definition meets the standard set by the community long ago (whether or not any existing implementations are compliant out of the gate), or it does not, and I would join others in arguing that the proposed definition does not.

Existing Open Source AI

Stefano Zacchiroli agreed in the earlier thread, stating that “the ability to redistribute the original training dataset should be a requirement for the freedom to modify an ML-based AI system, because short of that the amount of modifications one can do to such a system is much more limited.” He beat me to responding in the later thread that appropriately licensed training data sets like The Stack v2 (and Wikipedia, Wikidata, Wikimedia Commons, etc.) do exist and are already being used to create what I/we would consider to be true “Open Source AI” like StarCoder2. The “truly open” (their words, not mine) Open Language Model (OLMo) was also cited, demonstrating that truly Open Source AI is an achievable aim, provided we don’t poison the well by lowering the well-established standards (irrespective of noted existing ab/use of the term “Open Source” in the machine learning community, which is also irrelevant). This “prior art” meets the board approval criteria that it “Provides real-life examples: The definition must include relevant examples of AI systems that comply with it at the time of approval, so cannot have an empty set.”

Neither thread came to even a rough consensus, with folks on both sides of the debate maintaining their strongly-held positions. Various issues were raised around not being able to fix models trained on defamatory data, models being derivative works of training data and thus inheriting the license, or alternatively being mathematical calculations unprotectable by copyright, the practicality of hosting large data sets and risk of third–party hosted data sets disappearing, the EU’s AI law exclusion for “open weights” only models (which doesn’t matter), whether we are weighing reproducibility too heavily or not enough (if not for reproducibility of binaries, what is the point of source code?), and whether byte-for-byte reproducibility is an achievable or even useful goal, the issue of open access but copyrighted data including Flickr images and ImageNet, and various other interesting but often irrelevant arguments (before devolving into a debate about OSI governance and the history of the OSD).

Weasel Words

By draft 0.0.8 the weasel word “information” was introduced, defining the contrived term “data information” (otherwise known as metadata, as in “data about data”) to be “sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data”. The excuse given on behalf of these non-free models is that the training data is inconvenient to release or unable to be released for whatever reason, typically because it’s subject to proprietary copyrights (e.g., New York Times articles, YouTube transcripts), or because it contains personally identifiable information (e.g., healthcare records). Encouraging the creation and curation of free data sets rather than undermining them is precisely why we owe it to the community to meet our own well-established standards.

In current draft 0.0.9 the additional requirement that “Data information shall be made available with licenses that comply with the Open Source Definition” was also introduced, but is too little, too late; metadata licensing is meaningless if the data itself is unavailable. A redundant and confusing attempt to disambiguate AI models (including weights) from AI weights was also added despite both code and weights being covered earlier. It is clear to me and many others that this version, which is set to become the release candidate and ultimately the 1.0 release at All Things Open on 28 October 2024, remains deficient and should not be adopted. It’s like the process — not helped by the curious board approval criteria that it be “Ready by October 2024: A usable version of the definition needs to be ready for approval by the board at the October board meeting” rather than when it’s ready — went something like:

  1. We have to do something.
  2. This is something.
  3. We have to do this.

Why not both?

I would be remiss at this point not to acknowledge the carefully-considered compromise offered by fellow maintainer of the Personal Artificial Intelligence Operating System (pAI-OS) — hence my interest in the topic — and fellow open source old guard Karsten Wade, who presented a detailed proposal to satisfy both camps by bifurcating Open Source AI models into the default required to release the data (“D+”) and those with an “integrity exception” (“D-“). As discussed in the earlier thread, there is precedent for this dating back to the 1991 GPLv2 system library exemption for proprietary operating system components, but that was when it was not going to be possible to release an Open Source operating system without them, which is no longer the case and was never the case here. A related exception was the creation of the Lesser GPL (LGPL), with which an author can allow their code to be linked with other non-free software.

While this proposal has some traction with another opponent of the proposed definition, Steven Pousty pointed out that the “OSI has made no such distinction with software so why should they with AI/ML.” I also prefer to keep solutions as simple as possible (but no simpler), and in any case the OSAID needs to function as a binary litmus test like the Open Source Definition (OSD): either you’re in (i.e., “D+”) or you’re out. Fortunately, there’s a perfectly good existing term that describes “D-“: Open Weights.

Toxic Candy

We owe nothing to those that don’t commit to the cause and need not pander to the wants of their business models, but one could still make the case for a pejoratively-named “Lesser” license for those who don’t or can’t release training data. Mo Zhou, a fellow Debian developer who drafted Debian’s ML Policy and author of the earlier thread, has a special name for these appealing yet problematic models: Toxic Candy.

Toxic Candy Model is a model satisfying ALL the following conditions:

  1. The model itself is FOSS-Licensed & DFSG-compliant;
  2. trained from unknown, private, or non-free datasets or simulators;
  3. corresponding training program or script is present and complete;

It’s worth referencing the definition of a free (as in freedom, not as in beer) model, as Debian appear to have already got it right half a decade ago:

Free Model is a model satisfying ALL the following conditions:

  1. The model itself is FOSS-Licensed & DFSG-compliant;
  2. trained from explicitly FOSS-licensed & DFSG-compliant datasets (e.g. for supervised or unsupervised learning) or simulators (e.g. for reinforcement learning), and the dataset is publicly available to anonymous users;
  3. corresponding training program is present and complete;

While there are a few OSI Approved licenses that are unacceptable under Debian’s DFSG (OSL, RPL, CPAL, etc.), the fact that most models blessed as kosher by the OSAID today would be permanently ineligible for inclusion in Debian should be a five-alarm fire for those pushing it. There’s good reason why Debian won’t distribute AI models any time soon, but that’s no excuse to lower the bar.

OpenAI on Open AI

Being one to eat my own dogfood, I had a chat with the cloud brain (ChatGPT), asking it:

What is the equivalent of the “preferred form” for open source software (i.e., source code, as in “the source code must be the preferred form in which a programmer would modify the program”) in the context of Artificial Intelligence?

This is what it had to say:

In summary, the “preferred form” of an AI system for modification would include:

  • The source code for the model architecture and training process.
  • The raw or cleaned training data, or at least a detailed description of it.
  • The model’s learned weights or checkpoints.
  • The configuration settings for training, including hyperparameters.
  • Scripts for training, evaluating, and fine-tuning the model.
  • Documentation that contextualizes the model’s function, performance, and limitations.

All these components together form the most modifiable and transparent version of an AI system, analogous to source code in traditional software.

Addressing training data specifically, it had this to say (closing the loophole it opened in the bullet point above):

Training Data: In many AI systems, the model is only as good as the data it was trained on. Thus, providing access to the “preferred form” would mean giving access to the training data in its raw or cleaned form, as well as details about any preprocessing steps. Transparency about data sources and curation is critical for reproducibility and auditability.

Preferred Form

If we focus on the Open Source Definition‘s critical “preferred form” term — which prevents folks from distributing source code in Japanese on punch cards to meet the letter but not the spirit of the law — then it is a question for practitioners: if tasked with modifying a model then would the preferred form be the input to the training process (i.e., the data), or the output from it (i.e., the model), as Stefano Zacchiroli stated:

The main argument for including data in the definition is that they are part of the preferred form for modification of an AI system. We can debate (ideally, together with the actors who are actually creating and modifying AI systems) whether it is true or not that training data is part of that “preferred form”, but it is undeniable that if they are, then the definition of Open Source AI must include them.

Sure, you can make minor modifications with the model only (e.g., fine tuning), but for major modifications (e.g., removing, substituting, or supplementing parts of the corpus) you absolutely need the data. Any restriction on the four freedoms above must render the candidate inadmissible, and it is crystal clear that lack of data restricts the ability to study and modify the system at least. I would further argue that it restricts its use — as deploying black boxes of unknown and/or unverifiable provenance is unacceptable to many users — as well as the ability to share in the traditional sense of the term Open Source (i.e., with those same freedoms).

Historical Mistake in the Making

Indeed, by cementing the status quo it is possible to do more harm than good, and I would cite the precautionary principle in saying that the onus is on those advocating this watered-down approach to address the risk to the Open Source community and fledgling ecosystem of truly Open Source models. Given the sustained, well-justified objections, I would argue that the third board approval criteria — that it be “Supported by diverse stakeholders: The definition needs to have approval by end users, developers, deployers and subjects of AI, globally” — has not been met. No amount of “endorsers” will change that.

At the IETF, lack of disagreement is more important than agreement, and “coming to consensus is when everyone (including the person making the objection) comes to the conclusion that either the objections are valid, and therefore make a change to address the objection, or that the objection was not really a matter of importance, but merely a matter of taste.”

To quote Mo Zhou:

If OSAID does not require original training dataset, I’d say it will become a historical mistake.

I agree, and whether or not you do too, you should join the discussion.

3 Likes

I believe the decision to release OSAID 1.0 is up to the OSI Board of Directors, and RC1, which is the preceding stage, has not yet been published. Wouldn’t it be better to reconsider once RC1 is released? At the very least, there have been several discussions about handling datasets, which you are concerned about, so I suspect there will be some reflections of these in the RC1 version.

That aside, your suggestion brings up an interesting point about Debian ML policy (as a longtime Debian fan since the '90s, I’m naturally pleased to see Debian mentioned).

Just to confirm, you’ve linked to a version of ML-Policy.rst from five years ago—is this intentional? The “Sourceless Model” defined in that five-year-old version clearly accepts non-free datasets, yet it is still allowed in the main section packages. In other words, AI models that Debian accepts would not be recognized under OSAID.

Sourceless Model satisfies ALL the following conditions:
(1) FOSS-licensed & DFSG-compliant;
(2) trained from unknown, private, or non-free datasets or environment simulators;
(3) training program or script is present;
(4) affects no critical decision of any program, and doesn’t take part in the learning process of any other machine learning model;

For the sake of Debian and your reputation, I’ve shared the current version of Debian’s ML-Policy.rst below:

It seems that Debian no longer uses the “Sourceless Model” category, and these have been incorporated into ToxicCandy. However, separate from these distinctions, the concept of “reproducibility for Serialized Model” has been introduced. I don’t see much difference between this “Reproducibility” and OSAID’s “recreate a substantially equivalent system using the same or similar data.” Also, the current Debian ML policy states that up to “Type-F reproducibility” can be included in the main section.

At present, there aren’t any major differences between the Debian policy (I assume this ML policy is unofficial?) and OSAID, but it’s true that Debian’s policy is more detailed. It may be worth studying their policy a bit more.

What do you think, @zack -san?

I’m not expecting more than the shifting of deck chairs between 0.0.9 and RC1, and I expect the endorsers will be announced at the same time — a pointless metric if there ever was one.

Indeed, I imagine @Mer is already putting the finishing touches on her “Defining Open Source AI” presentation at Nerdearla 17:15-17:50 this Thursday, and I’m just hoping we can decide instead to measure twice and cut once on this given the damage we’re about to do to our cause. It’s not yet too late for us to examine the specific issue of data closer, with a view to producing a higher quality deliverable.

Developing an Open Source AI operating system, it is of great concern that we’re going to be lumped in the same bucket as “toxic candy” that does nothing to protect the 4 freedoms, which is why lining up this own goal constitutes a hair-on-fire emergency for us (and should for you).

Yes, I deliberately deep-linked to the 5-year old revision to show that one guy managed to achieve in his definition of Free Models what has escaped us after 17 town halls and who knows how many other meetings. I also linked to the current version which eliminates the “Sourceless” model, analogous to the “D-” quadrant @quaid proposed in another thread.

Yes, it’s unofficial, but it reflects the reality that Debian won’t distribute AI models any time soon. Which is fine, and better than compromising on our principles (i.e., the DFSG).

OSAID in its current form is more analogous to the Toxic Candy models “trained from unknown, private, or non-free datasets or simulators”, as it does not require data. For many/most models today this is not entirely different from distributing a recipe that requires unicorn eggs, and about as useful too from an Open Source perspective.

If an “unknown” or “private” dataset is used, it is naturally impossible to reproduce the model. In other words, systems that use such datasets do not comply with OSAID.

The issue we are grappling with in the models categorized as Toxic Candy by the Debian ML team may be related to the “non-free datasets” part. No, perhaps the real issue is, “What exactly is a free dataset?” Unlike source code, datasets contain various types of content, and there are rights beyond copyright that arise, which are handled differently depending on the country. Even if a dataset is licensed under an OSD-compliant or DFSG-free license, if third-party privacy or publicity rights are included in that DFSG-free dataset, it cannot be freely distributed. However, even in such cases, it is legal in many jurisdictions to use such datasets for machine learning, and in many cases, datasets can be accessed by anyone for machine learning purposes. Based on these cases, I believe the current discussion is about trying to draw a reasonable line.

Two years ago, the Debian project accepted non-free firmware, even amending their Social Contract to do so. Some may argue that this move allowed Debian to tolerate Toxic Candy, but I believe it was a reasonable decision made in light of modern usage needs (though personally, I felt quite sad about it.).

The difference between the firmware discussion and the OSAID discussion is that we require models to be reproducible. This means that those who want to replicate the model must also be able to reproduce the dataset. If anyone can recreate the dataset used for machine learning purposes, wouldn’t that be considered free?

Well, in fact they do comply with OSAID 0.0.9.

All developers have to do to comply is to provide plausible “Data Information” pretending they trained the system on a publicly available dataset.

Such “facade dataset” might even be a proper subset of the dataset actually used, but without the availability of the whole dataset actually used in training, nobody could really study or modify that AI system.

Well, when the dataset cannot be distributed but are available for anyone to use in machine learning under the same legal terms that allowed their use during the system training, the system can still be considered Open Source AI, since both the freedom to study and the freedom to modify are preserved.

But being available is different from being made available.
Without requiring the dataset distribution, all legal issues disappear.

What do you think about such solution?

Sure, if anyone can recreate the whole and exact dataset used during the training of a system, such system could be defined Open Source AI.

If it’s just about providing plausible data information, wouldn’t that mean they “can falsely claim compliance” rather than actually being “compliant”? Yes, there are indeed several ambiguous areas in the current OSAID that allow for such misrepresentation, and I believe many people are aware of this fact.

That being said, while I think some of the terms in the current OSAID could be refined, I also believe that the definition should retain a certain degree of ambiguity. The “Open Source Definition” we advocate for also leaves room for interpretation when viewed from a legal perspective. What fills in these interpretive gaps is the license review process, where long-standing community culture and legal consistency are taken into account to reach reasonable interpretations. I believe it is through the accumulation of these reviews that OSI has built its trust.

Some view the release of OSAID 1.0 as a potential threat to OSI’s credibility, but I believe the approval process is more important than the definition itself. Even with current license reviews for source code, we often see people confidently asking OSI to approve their “groundbreaking license as open source.” However, such licenses are rarely approved. Ultimately, the key issue for OSAID will be how to build a strict and sustainable approval process.

That said, it’s still unclear what exactly OSI will review based on OSAID and how they will do it. This concerns me. I understand that the focus will be on reviewing licenses and legal conditions, not the AI systems themselves, but I’m still unsure whether the current review process will remain as is or whether a new process will be created. As a volunteer collaborator, I don’t have clarity on this yet.

To be brief. I posted that “Training data access” issue at the first place. I’m the author who wrote Debian’s ML-Policy. The “ToxicCandy” is the most important concept introduced in that document, and I still insist so. ML-Policy is a unofficial document, created after a lengthy discussion in debian-devel mailing list.
I’m unwilling to endorse the current state of the OSAID since my concerns are still there.

3 Likes

If one day Debian community were insane enough to allow ToxicCandy, I’ll quit as a protest.
Firmware is something that my computer does not work without it (like iwlwifi) – so I’m forced to use them or I’d better buy the same hardware as RMS.
ToxicCandy model does not block me from using my computer.

2 Likes

Thank you Sam. You have made a very nice and impressive summary from lots of information sources, including some details that I’ve almost forgotten but written by me. :laughing:

1 Like

Thanks for this thread @samj and thanks @lumin for the priceless Toxic Candy term. This covers exactly what I’ve been thinking about Llama. In their Llama 2 “technical report” they mention the existence of a RLHF fine-tuning dataset they refer to as the Meta reward modeling data.

This is the same company that had no qualms doing the most massive social experiment in the history of social media on their own users, without informed consent (the resulting paper was retracted for this reason). We have no way of checking what’s in there and more than enough reasons to suspect that it more than just guardrails to make the thing sound all nice and docile. Why wouldn’t they also inject a bit of positivity about their business partners, or any of a number of other possible social engineering backdoors?

You may think one has to be paranoid to suspect this, but Facebook really only has to live up to its reputation here. And most importantly, without training data transparency, we’ll never know.

2 Likes

Why do you quote llama in this context? That system is failing at every level of the Open Source AI Definition in its current form, from version 1 to 3:

  • Data Information: not sufficiently detailed
  • Code: no training code
  • Parameters: not available under OSI approvable terms.

All versions of Llama fail at all three points. Why do you bring it up?

I bring it up because i love the Toxic Candy metaphor, as I said. I am aware under the current OSI def it is not seen as open source, although as we all know Zuckerberg keeps coopting that term.

1 Like