Draft v.0.0.7 of the Open Source AI Definition is available for comments

stefano · April 12, 2024, 4:10pm

A new draft is out for comments. The changelog:

Incorporating the comments to draft v.0.0.6 and results of the working group analysis
removed reference to “the public” in the 4 freedoms, left the object (users) implied
removed reference to ML systems following the text “Precondition to exercise these freedoms is to have access to the preferred form to make modifications to the system”
separated the ‘checklist’ and made it specific to ML systems, based on the Model Openness Framework
described in highly generic terms the conditions to access the model parameters

Known issue

The model parameters don’t have a clear specification of legal frameworks. The table says Available under terms compatible with Open Source principles which is not ideal.

The lawyers will have to chime in to clarify if we need to develop a checklist of criteria to evaluate legal documents specifically written to distribute model parameters or if the OSD is enough or we need something else entirely.

Next steps

This version is starting to get close to “feature complete”: there is a preamble, there is a definition and there is a checklist of default components, with the legal frameworks of acceptable terms of use, modification and distribution.

The next steps are:

review with legal experts at the Legal and Licensing Workshop next week
public review here

anon18632855 · April 13, 2024, 12:57am

Apologies for not contributing to the discussion earlier, but I’m reposting my comment here as I worry that while the definition (which should probably be labeled as such rather than “What is Open Source AI”) requires that users be able to “modify the system for any purpose” (which is implied in the DFSG and implemented in its terms), the checklist makes the requisite inputs for said modifications (i.e., training data) optional but “appreciated”.

The purpose of DFSG’s source code provision (“The program must include source code, and must allow distribution in source code as well as compiled form.”) is to enable users to modify behaviour and distribute the results in source (i.e., training data) and “compiled” (i.e., model weights) form.

It’s one thing to be able to deploy a model for inference — and indeed there’s little point in distributing one without permission to use it — and another altogether to have the freedom to change it, for example by transforming, reducing, or expanding the training data.

By making training, testing, and validation data set optional but “appreciated”, this freedom is not protected; it’s the AI equivalent of freeware distributed without source code.

Granted, most models will not meet the definition, but most software is proprietary rather than open source. An example of a model that should meet the definition is one trained on Wikipedia, itself “available under open documentation license” (CC).

zack · April 13, 2024, 6:46am

I am not clear on what “Available under open documentation license” mean in the new definition draft. Specifically, “open documentation license” is not defined anywhere, but should be.

This is in terms of “well-formedness” of the definition.

In terms of policy, I think all requirements “Available under open documentation license” should instead be “Available under OSI-compliant license”, because there is no reason to treat documentation any different than code.

stefano · April 13, 2024, 8:31am

That’s a copy paste fail of the last minute, I apologize. A new meaning for the old rule “Never release on Fridays”. Sorry for the confusion, I’m fixing and updating the version number to a patch release.

As I’m doing that I also noticed a minor issue: should OSI-compliant be OSD-compliant instead?

stefano · April 13, 2024, 8:46am

Welcome to the debate

The issue is not that “most models will not meet the definition” but that none will do. Consider that the most open systems like Pythia currently cannot distribute their source data because of legal uncertanties so they technically would fail such a requirement. Additionally, a whole class of ML systems would be excluded from ever being Open Source AI (federated learning and other privacy-preserving techniques). This thread has more arguments.

pchestek · April 13, 2024, 1:10pm

I would say “Available under terms that satisfy (or perhaps “meet”) the Open Source principles.” “Compatible” leave wiggle room; things can be compatible with a standard without satisfying the standard.

As to a checklist, isn’t that what the OSAID is? There can be many ways that a document can be written to ensure that the rights are available (compare BSD to CAL). I would hesitate at this point to dictate any particular requirements for the documents, at least not until we have some examples of AI licenses that at least attempt to be open source.

zack · April 17, 2024, 8:13am

I’m not sure if you meant that in absolute terms, but for what is worth, this is not true.

For instance, in the domain of LLMs for Code, StarCoder2 is a state-of-the-art model whose training dataset is redistributable (and redistributed).

I’m less familiar with other application domains, but ML models trained only on data sources like Wikipedia, Wikidata, Wikimedia Commons, etc., could also easily redistribute their training datasets.

ydietrich · April 19, 2024, 7:53am

Stefano, I am new to the initiative but working on the issue of the legal status of weight models for a while - As many other lawyers, it looks that copyright protection cannot be available there as we are talking about mathematical calculations and this means that no protection can be available other than confidentiality (but that will disappear as soon as they are make public). So, licensing them through any kind of copyright license will not be enforceable !! and this means that anybody can use them without any copyright restriction (assuming that they have been made public) and this means that you cannot enforce any kind of provisions such as attribution, no warranty or copyleft. There is a work-around there by using contractual terms meaning the recipient of the model will be contractually restricted - This will trigger a longer conversation as such a contractual arrangement is a bit complex to structure it in a way that makes sense for open source, but it will avoid waiting for international discussion / harmonization about how copyright or others IP may cover weights (but again given their mathemathical nature, it will be very tricky to consider any intellectual property) - happy to help further working on this