The Open Source AI Definition v.1.0-RC1 is available for comments

Originally published at: The Open Source AI Definition RC1 is available for comments – Open Source Initiative

A little over a month after v.0.0.9, we have a Release Candidate version of the Open Source AI Definition. This was reached with lots of community feedback: 5 town hall meetings, several comments on the forum and on the draft, and in person conversations at events in Europe, China, India, Senegal, and Argentina.

There are three relevant changes to the part of the definition pertaining to the “preferred form to make modifications to a machine learning system.”

The feature that will draw most attention is the new language of Data Information. It clarifies that all the training data needs to be shared and disclosed. The updated text comes from many conversations with several individuals who engaged passionately with the design process, on the forum, in person and on hackmd. These conversations helped describe four types of data: open, public, obtainable and unshareable data, well described in the FAQ. The legal requirements are different for each. All are required to be shared in the form that the law allows them to be shared.

Two new features are equally important. RC1 clarifies that Code must be complete, enough for downstream recipients to understand how the training was done. This was done to reinforce the importance of the training, both for transparency, security and other practical reasons. Training is where innovation is happening at the moment and that’s why you don’t see corporations releasing their training and data processing code. We believe, given the current status of knowledge and practice, that this is required to meaningfully fork (study and modify) AI systems.

Last, there is new text that is meant to explicitly acknowledge that it is admissible to require copyleft-like terms for any of the Code, Data Information and Parameters, individually or as bundled combinations. A demonstrative scenario is a consortium owning rights to training code and a dataset deciding to distribute the bundle code+data with legal terms that tie the two together, with copyleft-like provisions. This sort of legal document doesn’t exist yet but the scenario is plausible enough that it deserves consideration. This is another area that OSI will monitor carefully as we start reviewing these legal terms with the community.

A note about science and reproducibility

The aim of Open Source is not and has never been to enable reproducible software. The same is true for Open Source AI: reproducibility of AI science is not the objective. Open Source’s role is merely not to be an impediment to reproducibility. In other words, one can always add more requirements on top of Open Source, just like the Reproducible Builds effort does.

Open Source means giving anyone the ability to meaningfully “fork” (study and modify) a system, without requiring additional permissions, to make it more useful for themselves and also for everyone. This is why OSD #2 requires that the “source code” must be provided in the preferred form for making modifications. This way everyone has the same rights and ability to improve the system as the original developers, starting a virtuous cycle of innovation. Forking in the machine learning context has the same meaning as with software: having the ability and the rights to build a system that behaves differently than its original status. Things that a fork may achieve are: fixing security issues, improving behavior, removing bias. All these are possible thanks to the requirements of the Open Source AI Definition.

What’s coming next

With the release candidate cycle starting today, the drafting process will shift focus: no new features, only bug fixes. We’ll watch for new issues raised, watching for major flaws that may require significant rewrites to the text. The main focus will be on the accompanying documentation, the Checklist and the FAQ. We also realized that in our zeal to solve the problem of data that needs to be provided but cannot be supplied by the model owner for good reasons, we had failed to make clear the basic requirement that “if you can share the data you must.” We have already made adjustments in RC1 and will be seeking views on how to better express this in an RC2.

In the next weeks until the 1.0 release of October 28, we’ll focus on:

  • Getting more endorsers to the Definition
  • Continuing to collect feedback on hackmd and forum, focusing on new, unseen-before concerns
  • Preparing the artifacts necessary for the launch at All Things Open
  • Iterating on the Checklist and FAQ, preparing them for deployment.

Full changelog

  • Rewritten Data Information for clarity based on multiple feedback, at events, townhall, forum and hackmd
  • Cleaned up the Definitions section, removed the decorations. Also, moved the section at the end of the document, to give focus to the main content.
  • Incorporated comment from Yann on hackmd added “at least” to the benefits of Open Source AI
  • Clarified that Code must be complete, enough to understand how the training was done
  • Renamed Weights to Parameters (as it was called in previous versions), based on public feedback
  • Added a definition of machine learning, by OECD
  • Some more cosmetic changes based on feedback on hackmd

Link to the Open Source AI Definition – 1.0-RC1

You’ve heard a lot from me on this already, so I asked gpt-o1 use reasoning to critique the release candidate from a practical perspective with respect to the four freedoms (which is my primary concern). It agrees the proposed definition still fails to protect 2 or 3 of the 4 freedoms.

I also ran a small straw poll of AI practitioners (n=15) asking “What is the preferred form in which a practitioner would modify a model?”. Fully 87% chose Data (i.e., training datasets) over Model (i.e., weights & biases). We could run a larger and more controlled survey, but this is not even close. Let’s avoid contentious claims to the contrary when this is clearly a compromise, resulting in a compromised definition.

I do appreciate the efforts to incorporate feedback right up until the last minute, and I see you’ve switched to “no new features, only bug fixes”. Hopefully there will be an acknowledgement at launch that this is a living document, as it took until v1.9 to get the OSD mostly right. There should be a concerted effort to tighten it up and fully protect the four freedoms in the future.

Assessment of the OSAID’s Protection of the Four Freedoms from a Practical Perspective

While the OSAID aims to protect the four essential freedoms of Free Software, a practical examination reveals limitations that may hinder a researcher’s ability to study an AI system and a practitioner’s capacity to modify it effectively.

  1. Freedom to Run the Program (Freedom 0):
  • Practical Perspective: Users can run the AI system without restrictions.
  • Critique: This freedom is upheld in practice, as the OSAID explicitly allows the use of the system for any purpose.
  1. Freedom to Study the Program (Freedom 1):
  • Practical Perspective: Without access to the actual training data, researchers face significant obstacles in thoroughly understanding the AI system.
    • Data Analysis Limitations: Key aspects like biases, data quality issues, and representativeness cannot be fully assessed without the raw data.
    • Transparency Concerns: Detailed Data Information may not substitute for direct data access, leaving gaps in understanding how the model processes inputs to produce outputs.
  • Critique: The OSAID permits withholding the training data, relying instead on “Data Information.” This undermines the practical ability to study the system comprehensively, as critical insights are often derived from examining the data itself.
  1. Freedom to Modify the Program (Freedom 2):
  • Practical Perspective: Modifying an AI system typically involves retraining or fine-tuning the model, which requires access to the training data.
    • Retraining Challenges: Without the original data, practitioners cannot retrain the model to correct issues, adapt it to new domains, or improve its performance.
    • Limited Modifications: Modifications are constrained to code-level changes or adjustments to parameters, which may not significantly alter the model’s behavior.
  • Critique: The lack of access to training data severely limits the practitioner’s ability to modify the AI system meaningfully. The OSAID’s allowance for non-disclosure of data creates a barrier to exercising this freedom fully.
  1. Freedom to Share the Program (Freedom 3):
  • Practical Perspective: Users can share the AI system, but potential legal constraints may impede this freedom.
    • Licensing Ambiguities: The OSAID mentions that parameters “shall be made available under OSI-approved terms,” but does not clarify the legal status of model parameters.
    • Intellectual Property Issues: The possibility of patents or other legal instruments can restrict sharing, even if the OSAID’s licensing requirements are met.
  • Critique: While the OSAID supports sharing in principle, practical obstacles like intellectual property rights can limit this freedom. The definition does not fully address how such legal issues might affect the ability to share the AI system.

Overall Critique:

The OSAID’s provisions allow for the non-disclosure of training data and do not adequately prevent other legal mechanisms (like patents or trademarks) from restricting the use, study, modification, or sharing of AI systems. As a result, the four essential freedoms are only partially protected in practice. Researchers and practitioners may find themselves unable to exercise these freedoms fully due to practical limitations imposed by the definition’s allowances and omissions.

2 Likes

Whew, I finally managed to japanese translate RC1 into my native language. Now I can calmly and thoroughly review it.

There are a minor revisions in the Preamble. In the sentence “For AI, society needs at least the same essential freedoms of Open Source,” the phrase “at least” has been added. It seems to suggest the possibility that AI may require freedoms beyond the four essential ones.
I’m not trying to express approval or disapproval, but I find it interesting that this change was made.

Overall, I feel it’s a positive shift. I’ll take a more thorough look at it after I wake up tomorrow morning.

PS: The ‘at least’ was mentioned in the changelog.

2 Likes

:pray: At least the scope is limited, not too free​:thinking: I agree with this part
:muscle: But I still think that, like a silver coin with two sides, those who come to use the original invention have both good and bad intentions to modify it for reuse. But I think it would be good if we could increase the clarity of the reuse of data to prevent security. And if we add some requirements to honor the creator of the original work, it would be great. :pray: I may have digressed a bit often, but I think the purpose that many people talk about, some people should consider security, like me. :pray: In the future, there may be both destroyers and defenders in the same place, which is caused by small points that are overlooked. :pray:
Generally, I agree🙏Although it does not contain all the elements in this category🙏
Fairness to all who create or use it
Trustworthiness and safety
Respect privacy and security
Inclusiveness, promotion of inclusiveness and attraction
Transparency in creation and use
Accountability for consequences if problems arise in the future