The Open Source AI Definition v.1.0-RC1 is available for comments

system · October 2, 2024, 5:04pm

Originally published at: The Open Source AI Definition RC1 is available for comments – Open Source Initiative

A little over a month after v.0.0.9, we have a Release Candidate version of the Open Source AI Definition. This was reached with lots of community feedback: 5 town hall meetings, several comments on the forum and on the draft, and in person conversations at events in Europe, China, India, Senegal, and Argentina.

There are three relevant changes to the part of the definition pertaining to the “preferred form to make modifications to a machine learning system.”

The feature that will draw most attention is the new language of Data Information. It clarifies that all the training data needs to be shared and disclosed. The updated text comes from many conversations with several individuals who engaged passionately with the design process, on the forum, in person and on hackmd. These conversations helped describe four types of data: open, public, obtainable and unshareable data, well described in the FAQ. The legal requirements are different for each. All are required to be shared in the form that the law allows them to be shared.

Two new features are equally important. RC1 clarifies that Code must be complete, enough for downstream recipients to understand how the training was done. This was done to reinforce the importance of the training, both for transparency, security and other practical reasons. Training is where innovation is happening at the moment and that’s why you don’t see corporations releasing their training and data processing code. We believe, given the current status of knowledge and practice, that this is required to meaningfully fork (study and modify) AI systems.

Last, there is new text that is meant to explicitly acknowledge that it is admissible to require copyleft-like terms for any of the Code, Data Information and Parameters, individually or as bundled combinations. A demonstrative scenario is a consortium owning rights to training code and a dataset deciding to distribute the bundle code+data with legal terms that tie the two together, with copyleft-like provisions. This sort of legal document doesn’t exist yet but the scenario is plausible enough that it deserves consideration. This is another area that OSI will monitor carefully as we start reviewing these legal terms with the community.

A note about science and reproducibility

The aim of Open Source is not and has never been to enable reproducible software. The same is true for Open Source AI: reproducibility of AI science is not the objective. Open Source’s role is merely not to be an impediment to reproducibility. In other words, one can always add more requirements on top of Open Source, just like the Reproducible Builds effort does.

Open Source means giving anyone the ability to meaningfully “fork” (study and modify) a system, without requiring additional permissions, to make it more useful for themselves and also for everyone. This is why OSD #2 requires that the “source code” must be provided in the preferred form for making modifications. This way everyone has the same rights and ability to improve the system as the original developers, starting a virtuous cycle of innovation. Forking in the machine learning context has the same meaning as with software: having the ability and the rights to build a system that behaves differently than its original status. Things that a fork may achieve are: fixing security issues, improving behavior, removing bias. All these are possible thanks to the requirements of the Open Source AI Definition.

What’s coming next

With the release candidate cycle starting today, the drafting process will shift focus: no new features, only bug fixes. We’ll watch for new issues raised, watching for major flaws that may require significant rewrites to the text. The main focus will be on the accompanying documentation, the Checklist and the FAQ. We also realized that in our zeal to solve the problem of data that needs to be provided but cannot be supplied by the model owner for good reasons, we had failed to make clear the basic requirement that “if you can share the data you must.” We have already made adjustments in RC1 and will be seeking views on how to better express this in an RC2.

In the next weeks until the 1.0 release of October 28, we’ll focus on:

Getting more endorsers to the Definition
Continuing to collect feedback on hackmd and forum, focusing on new, unseen-before concerns
Preparing the artifacts necessary for the launch at All Things Open
Iterating on the Checklist and FAQ, preparing them for deployment.

Full changelog

Rewritten Data Information for clarity based on multiple feedback, at events, townhall, forum and hackmd
Cleaned up the Definitions section, removed the decorations. Also, moved the section at the end of the document, to give focus to the main content.
Incorporated comment from Yann on hackmd added “at least” to the benefits of Open Source AI
Clarified that Code must be complete, enough to understand how the training was done
Renamed Weights to Parameters (as it was called in previous versions), based on public feedback
Added a definition of machine learning, by OECD
Some more cosmetic changes based on feedback on hackmd

Link to the Open Source AI Definition – 1.0-RC1

anon18632855 · October 2, 2024, 9:09pm

You’ve heard a lot from me on this already, so I asked gpt-o1 use reasoning to critique the release candidate from a practical perspective with respect to the four freedoms (which is my primary concern). It agrees the proposed definition still fails to protect 2 or 3 of the 4 freedoms.

I also ran a small straw poll of AI practitioners (n=15) asking “What is the preferred form in which a practitioner would modify a model?”. Fully 87% chose Data (i.e., training datasets) over Model (i.e., weights & biases). We could run a larger and more controlled survey, but this is not even close. Let’s avoid contentious claims to the contrary when this is clearly a compromise, resulting in a compromised definition.

I do appreciate the efforts to incorporate feedback right up until the last minute, and I see you’ve switched to “no new features, only bug fixes”. Hopefully there will be an acknowledgement at launch that this is a living document, as it took until v1.9 to get the OSD mostly right. There should be a concerted effort to tighten it up and fully protect the four freedoms in the future.

Assessment of the OSAID’s Protection of the Four Freedoms from a Practical Perspective

While the OSAID aims to protect the four essential freedoms of Free Software, a practical examination reveals limitations that may hinder a researcher’s ability to study an AI system and a practitioner’s capacity to modify it effectively.

Freedom to Run the Program (Freedom 0):

Practical Perspective: Users can run the AI system without restrictions.

Critique: This freedom is upheld in practice, as the OSAID explicitly allows the use of the system for any purpose.

Freedom to Study the Program (Freedom 1):

Practical Perspective: Without access to the actual training data, researchers face significant obstacles in thoroughly understanding the AI system.

Data Analysis Limitations: Key aspects like biases, data quality issues, and representativeness cannot be fully assessed without the raw data.

Transparency Concerns: Detailed Data Information may not substitute for direct data access, leaving gaps in understanding how the model processes inputs to produce outputs.

Critique: The OSAID permits withholding the training data, relying instead on “Data Information.” This undermines the practical ability to study the system comprehensively, as critical insights are often derived from examining the data itself.

Freedom to Modify the Program (Freedom 2):

Practical Perspective: Modifying an AI system typically involves retraining or fine-tuning the model, which requires access to the training data.

Retraining Challenges: Without the original data, practitioners cannot retrain the model to correct issues, adapt it to new domains, or improve its performance.

Limited Modifications: Modifications are constrained to code-level changes or adjustments to parameters, which may not significantly alter the model’s behavior.

Critique: The lack of access to training data severely limits the practitioner’s ability to modify the AI system meaningfully. The OSAID’s allowance for non-disclosure of data creates a barrier to exercising this freedom fully.

Freedom to Share the Program (Freedom 3):

Practical Perspective: Users can share the AI system, but potential legal constraints may impede this freedom.

Licensing Ambiguities: The OSAID mentions that parameters “shall be made available under OSI-approved terms,” but does not clarify the legal status of model parameters.

Intellectual Property Issues: The possibility of patents or other legal instruments can restrict sharing, even if the OSAID’s licensing requirements are met.

Critique: While the OSAID supports sharing in principle, practical obstacles like intellectual property rights can limit this freedom. The definition does not fully address how such legal issues might affect the ability to share the AI system.

Overall Critique:

The OSAID’s provisions allow for the non-disclosure of training data and do not adequately prevent other legal mechanisms (like patents or trademarks) from restricting the use, study, modification, or sharing of AI systems. As a result, the four essential freedoms are only partially protected in practice. Researchers and practitioners may find themselves unable to exercise these freedoms fully due to practical limitations imposed by the definition’s allowances and omissions.

shujisado · October 3, 2024, 2:53pm

Whew, I finally managed to japanese translate RC1 into my native language. Now I can calmly and thoroughly review it.

There are a minor revisions in the Preamble. In the sentence “For AI, society needs at least the same essential freedoms of Open Source,” the phrase “at least” has been added. It seems to suggest the possibility that AI may require freedoms beyond the four essential ones.
I’m not trying to express approval or disapproval, but I find it interesting that this change was made.

Overall, I feel it’s a positive shift. I’ll take a more thorough look at it after I wake up tomorrow morning.

PS: The ‘at least’ was mentioned in the changelog.

anatta8538 · October 4, 2024, 6:39pm

At least the scope is limited, not too free:thinking: I agree with this part
But I still think that, like a silver coin with two sides, those who come to use the original invention have both good and bad intentions to modify it for reuse. But I think it would be good if we could increase the clarity of the reuse of data to prevent security. And if we add some requirements to honor the creator of the original work, it would be great. I may have digressed a bit often, but I think the purpose that many people talk about, some people should consider security, like me. In the future, there may be both destroyers and defenders in the same place, which is caused by small points that are overlooked.
Generally, I agree🙏Although it does not contain all the elements in this category🙏
Fairness to all who create or use it
Trustworthiness and safety
Respect privacy and security
Inclusiveness, promotion of inclusiveness and attraction
Transparency in creation and use
Accountability for consequences if problems arise in the future

shujisado · October 5, 2024, 11:34am

In OSAID 0.0.9, the phrase “recreate a substantially equivalent system” was used in the description of data information, but in OSAID RC1, this expression was changed to “build a substantially equivalent system,” avoiding the use of the word “recreate.” I believe this is because the purpose of Open Source is not reproducibility. I think that reasoning is valid, but on the other hand, I feel that the scope of what constitutes a “substantially equivalent system” has slightly expanded. If we assume a scale from 0 to 10, with 0 being a system that is merely similar and 10 being an identical system, what was previously defined as an “8 or higher” now feels like it permits a range of around 6-7. Perhaps this concerns me because I’ve been scrutinizing it more closely while translating into Japanese.

However, unlike in 0.0.9, the addition of detailed descriptions of data information has made this point less of a concern. At this point, I feel that the explanation of data information is overall well-structured and now functions adequately as a definition.

One thing I am concerned about is the phrase “including for a fee” in point (3). I believe the provision is legitimate, but personally, I fear it could lead to unnecessary disputes when interpreted under Japanese law. This phrase likely refers to purchasing commercially available datasets. In Japan, Copyright Act Article 30-4, which broadly permits the use of copyrighted works for AI training without permission, is interpreted such that cases falling under “If the action would unreasonably prejudice the interests of the copyright owner” are generally limited to situations where the dataset is sold commercially. In other words, if someone purchases a paid dataset, that buyer can use it for AI training, but third parties are not granted the freedom provided under Article 30-4 and are effectively forced to purchase the dataset themselves. I don’t think this is a major issue, but I am concerned that the presence of the phrase “including for a fee” could spark sensitive discussions within Japan about the rights to use copyrighted works for AI training. However… after thinking about it for two days, I feel I may be overthinking this. But, I believe it would be better without the phrase…

Article 30-4 It is permissible to exploit a work, in any way and to the extent considered necessary, in any of the following cases, or in any other case in which it is not a person’s purpose to personally enjoy or cause another person to enjoy the thoughts or sentiments expressed in that work; provided, however, that this does not apply if the action would unreasonably prejudice the interests of the copyright owner in light of the nature or purpose of the work or the circumstances of its exploitation:
(ii) if it is done for use in data analysis (meaning the extraction, comparison, classification, or other statistical analysis of the constituent language, sounds, images, or other elemental data from a large number of works or a large volume of other such data; the same applies in Article 47-5, paragraph (1), item (ii));

gvlx · October 6, 2024, 4:06pm

Dear @anatta8538 ,

Regarding your remark

if we add some requirements to honor the creator of the original work

I do think that falls outside the objectives of the OSAID and the OSD, and would be better resolved under the license terms of a specific Ai system.

As it is, that requirement would not respect the Free Software’s Freedoms and would better be served by the philosophy behind the Ethical Stack.

gvlx · October 6, 2024, 5:35pm

Added my comments to the OSAID and the FAQ.

On the OSAID

Clarify the OSAID is like the OSD but applies the for freedoms to all and any element of the Ai system
Remove “preferred ways” - no longer needed if data is not excluded
Move “Definitions” to the FAQ, but change it in a way that clarifies that those definitions are stated on the primary sources indicated, were are not reinventing them

On the FAQ

Remove all sections explaining why data can be excluded as, per above definition, it no longer is
Move “skilled person” to definitions
Added more importance to the training code
Add a new question on why documentation is required

stefano · October 7, 2024, 8:41am

You’re correct, Open Source is not an obstacle to reproducibility of AI, just like it is not an obstacle for reproducible builds in software. The word “recreate” was confusing because the issue of reproducibility of ML is not even remotely solved (some argue it’s not solvable for many of the modern deep learning systems --plenty of literature on the topic.)

The scope remains the same though: Recipients of an Open Source AI must receive the ability (tools) and the rights (legal terms) to innovate on top of what they received, just like it happens with software.

The intention of that “for a fee” is to acknowledge that there are stewards of datasets that maintain good quality datasets and recognize that these can be part of training of Open Source AI. Is this something you can imagine we can live with until we have more experience and get ready to change it with future versions?

shujisado · October 7, 2024, 1:39pm

A quarter of a century has passed since the creation of the OSD, but it is still only referenced by those in the software industry. However, there is an abnormal level of public interest in AI, and the situation is similar in Japan. Particularly concerning the copyright law, which fully endorses machine learning, there has been tremendous opposition from the content industry. My concern was that if it is interpreted simplistically by the opponents of AI that “Open Source AI is being created from paid data,” it might further intensify their opposition to AI itself. Well… perhaps I’m overthinking it.

grugnog · October 7, 2024, 5:18pm

The Data Information section requires that all model input data is made available under and OSI license. However, point (1) seems to suggest that a model can still be considered open source even if it includes unsharable input data.

The intended interpretation here is unclear. My assumption is that unsharable or third-party data must be licensed by the data owner to the model developer under an OSI license but may still be subject to additional contract restrictions that prevent further sharing.

If the input data is OSI licensed but subject to additional contract constraints that restrict sharing, I believe this significantly undermines the freedoms to Share, Study, and Modify the model in its entirety.

Specifically:

Share: the definition includes input data (Data Information) as a component of an Open Source AI. If this input data cannot be shared, then can the resulting model possibly meet this definition?
Study: model users are limited in their ability to study the function and behavior of a model without access to the input data, as they are unable to characterize how different input data affects the resulting model.
Modify: while model users would be able to train the model to refine the behavior based on additional input data, they would not be able to modify the model by retraining it from scratch, for example my eliminating a portion of training data that they do not want to be incorporated into the final model.

My suggestions are:

Require that all Data Information used for training is both OSI licensed and freely sharable.
State that providers can charge for delivering or hosting Data Information (same as for OSS), but such charges should not restrict the sharing of the data itself.
Require disclosure of the specific OSI license(s) governing the input data to help users understand any implications for sharing, licensing, and uniqueness checks.

anon18632855 · October 7, 2024, 8:28pm

You are right to be concerned about the “including for a fee” capitulation, but for the wrong reason.

That you can charge for source code has been an accepted tenet of Open Source — which is “free as in freedom, not as in beer” — but only because anyone can obtain and share the source under the same license if the software is conveyed to them. That is, the market price is kept at or near zero, because the cost of hosting software for download is close to zero (or borne by someone else; even the petabytes of Common Crawl are available for free on AWS)… you don’t even need to pay for the CD/DVD any more! While it sounds similar, this is not at all the same thing.

With datasets like the NYT archive estimated to cost in the tens to hundreds of millions of dollars range ($50-200m), the training cost (including licensing) of GPT-5 said to cost $2-2.5bn, and Anthropic already talking about $10bn training runs within 2 years, this concession risks restricting Open Source AI to the ranks of billionaires. As Japan’s own NII “keeps advancing the research and development of open generative AI” with LLM-jp in the LLM space, and Molmo have shown in the VLM space, among others, this is not a problem… unless we violate our core beliefs to certify closed models as Open Source AI.

Doing so would blatantly violate one of our community’s core principles—that we ‘must not discriminate against any person or group of persons’—by effectively discriminating against almost all Open Source users!

@grugnog (welcome!) asks that “input” data (which is not the same as “Data Information”: weasel words meaning metadata only) be made available under OSI licenses, but that’s a bridge too far in the other direction that would also have chilling effects on Open Source AI adoption, not to mention the board’s “cannot have an empty set” criteria. We do need to compromise, but we’ve chosen the wrong one and the baby’s going out with the bathwater. By opting for Open Access rather than Open Licenses for the data components we are not demanding vendors do the impossible (i.e., distribute unlicensed content), but we don’t need to!

Open access to data (e.g., Common Crawl) appears to be the only workable compromise that demonstrably protects the four essential freedoms, and it’s already widely accepted in the industry, including many/most of the 1,000,000 free public models on Hugging Face, as well as being reflected in the oft-cited Model Openness Framework (MOF)'s “any license or unlicensed” wording.

The new FAQ introduces four classes of data, only one of which (the first) has been traditionally acceptable in the context of Open Source, the second being acceptable as a compromise to protect the four freedoms, and the latter two being unacceptable:

Open training data: Acceptable to Open Source and Open Data communities.
Public training data: Acceptable as a compromise to protect the four freedoms.
~~Obtainable training data~~: Unacceptable due to obvious violation of non-discrimination clauses.
~~Unshareable non-public training data~~: Unacceptable due to the violations of any/all of the four essential freedoms.

The charitable explanation for those claiming datasets made available “including for a fee” are acceptable is that they are confused about the historical context, but that is not the only explanation. The closer we get to the Kamikaze launch (to borrow another Japanese reference) at 12:45pm on 28 October, despite sustained, well-reasoned counter-arguments from many of Open Source’s old guard, the more I’m inclined to come around to @Shamar’s perspective that these discussions are just a front for decisions made in advance/behind closed doors, analogous to the smoke-filled back rooms of cloud (and other) standardisation efforts.

Per @thesteve0’s challenge, you’re welcome prove me wrong, not only on the rights to inspect, modify, and recreate models, and the field of use restriction, but also on the “implications of not labeling a model Open Source” — we don’t owe anything to anyone but ourselves:

The Open Source label is a restrictive condition that not everyone wants or should want. Weakening the meaning of Open Source is a mistaken means for hoping organizations will be “more Open Source”.

Counterproof needed to falsify my position.

Demonstrate how not allowing a model to be called Open Source prevents the owner from sharing their model weights.

grugnog · October 7, 2024, 10:09pm

Thank you for the welcome!

The clarification on the meaning of “Data Information” in this context is helpful - that seems like a very confusing term! Especially given the widely used “information is data with meaning” definition, my intuitive sense on first reading this (as a native English speaker) was that was trying to include both raw data and manually curated information. Not sure what a better term would be - perhaps “Data sources and methodology”?

While I would still prefer open licensed training data (which is the “source” of the model weights, surely), I am more concerned about the inclusion of latter two classes of data than I am public data, as they have a much more substantive negative impact on the freedoms to Share, Study, and Modify the model in its entirety.

I do get the need to build AI models that operate on private/personal/health data. I think the releasable components of these can be released under normal OSI licenses though, as a framework that others can use to build an equivalent model that operates on their own data. This still has significant value and should be encouraged, but I don’t think this is equivalent to an open source AI.

anon18632855 · October 8, 2024, 12:07am

If we don’t understand it ourselves here, what hope does anyone else have? This wordsmithing is attempting to square the circle, but terms like “skilled person” appear to be amateur lawyering borrowing from patent, contract, and/or tort law where they have a specific legal meaning (IANAL either, by the way).

Absent “judges” to “execute” this “code” and “police” to “enforce” those decisions under the threat of violence, they serve no purpose here.

Surely, because if not then what? Unobtainable inputs (* except for million- and billionaires) are obviously absolute kryptonite for openness, and it boggles the mind that intelligent people are saying with a straight face that either of the latter categories are even candidates for inclusion. I may not be right about this, but I’m definitely not wrong.

The open definition most succinctly requires:

“Open data and content can be freely used, modified, and shared by anyone for any purpose”

Open Source additionally explicitly protects the freedom to study, but that’s implicit here. The Free Software definition actually merges study and modify because you can’t do the latter without the former (which is also why any and all documentation is optional):

The freedom to study how the program works, and change it so it does your computing as you wish (freedom 1). Access to the source code is a precondition for this.

They also deviate from “preferred” form because that too is surprisingly subjective, opting instead for whatever the developer actually changes:

Source code is defined as the preferred form of the program for making changes in. Thus, whatever form a developer changes to develop the program is the source code of that developer’s version.

The way to test this is to look at what the original developers actually changed to develop the system, or have developers make actual modifications/improvements to an existing system and look over their shoulders to see what inputs they used. Eliminating “preferred” in favour of “actual” would sharpen up this discussion significantly.

Improvements are subjective too, and developers must be allowed to make whatever change they want, in no way restricted to, for example, what is possible by fine-tuning model weights. If I want to make my self-driving car’s ASS feature fart on failure — that’s actually a thing a developer implemented in production code that made it into my vehicle last week — then that’s on me:

Whether a change constitutes an improvement is a subjective matter. If your right to modify a program is limited, in substance, to changes that someone else considers an improvement, that program is not free.

For these and other reasons, the first release candidate remains a field of red flags.

quaid · October 9, 2024, 9:35pm

Regardless of where you stand on data Openness and accessibility, we have consensus around the OSAI definition with two data classes supported, Open Data and Open Access.

Let’s focus on where we have consensus for the 1.0 release, and continue working on the thorny issues afterward without the pressure and risk.

Here is my proposal thread, which includes an RC2-STRAW draft for discussion:

taoye · October 16, 2024, 10:48am

Thank you for the welcome
I’ve reviewed the text of the RC1 version and compared it with version 0.0.9 and have a few questions:
1.In the RC1 version, the licensing requirement for data information and parameters is stated as “OSI-approved terms,” which is consistent with the requirement for parameters in version 0.0.9. However, in the 0.0.9 checklist, the requirement for licensing parameters is described as “OSD-conformant terms.” Could you clarify the relationship between the two? Additionally, from our perspective, “OSI-approved” implies that a specific approval action must be taken to meet this requirement. Will OSI in the future add a specific approval process for terms, in addition to the current licenses approval? Are there already some terms that have been approved, or can we just use certain terms from the approved licenses?
2.As I understand it, data information refers to descriptive information about the data, rather than the data itself. In the openness requirements for code, it requires “full specification of how the Data Information was processed,” which refers to the code used to process data descriptive information (though this is somewhat unclear to me). However, in the later examples, it asks for the “code used for pre-processing data” rather than “data information.” This make me a little bit confused about the scope of the code.
Thank you very much for your attention to our feedback!

shujisado · October 16, 2024, 1:00pm

The term “OSI-approved terms” is used for the first time in the RC1 version, and I believe it implies that the requirement is not limited to licenses, unlike the previous expression “OSI-approved license.” If this is not a mere typo, it likely suggests that OSI might handle legal documents such as contracts within the review process in the future. That’s how I understand it. However, I’m still unsure how the license review process itself might change.

Additionally, if OSI plans to handle “OSI-approved terms” in the review process as mentioned above, the term “OSD-conformant terms” may no longer be necessary.