Draft v.0.0.9 of the Open Source AI Definition is available for comments

Originally published at: Community input drives the new draft of the Open Source AI Definition – Open Source Initiative

The Open Source AI Definition v0.0.9 has been released and collaboration continues at in-person events and in the online forums. Read what changes have been made, what to do next and how to get involved.



A new version of the Open Source AI Definition has been released with one new feature and a cleaner text, based on comments received from public discussions and recommendations. We’re continuing our march towards having a stable release by the end of October 2024, at All Things Open. Get involved by joining the discussion on the forum, finding OSI staff around the world and online at the weekly town halls.

New feature: clarified Open Source model and Open Source weights

  • Under “What is Open Source AI,” there is a new paragraph that (1) identifies both models and weights/parameters as encompassed by the word “system” and (2) makes it clear that all components of a larger system have to meet the standard. There is a new sentence in the paragraph after the “share” bullet making this point.
  • Under the heading “Open Source models and Open Source weights,” there is a description of the components for both of those for machine learning systems. We also edited the paragraph below those additions to eliminate some redundancy.

Training data in the preferred form to make modifications

The role of training data is one of the most hotly debated parts of the definition. After long deliberation and co-design sessions we have concluded that defining training data as a benefit, not a requirement, is the best way to go.

Training data is valuable to study AI systems: to understand the biases that have been learned, which can impact system behavior. But training data is not part of the preferred form for making modifications to an existing AI system. The insights and correlations in that data have already been learned.

Data can be hard to share. Laws that permit training on data often limit the resharing of that same data to protect copyright or other interests. Privacy rules also give a person the rightful ability to control their most sensitive information, such as decisions about their health. Similarly, much of the world’s Indigenous knowledge is protected through mechanisms that are not compatible with later-developed frameworks for rights exclusivity and sharing.

  • Open training data (data that can be reshared) provides the best way to enable users to study the system, along with the preferred form of making modifications.
  • Public training data (data that others can inspect as long as it remains available) also enables users to study the work, along with the preferred form.
  • Unshareable non-public training data (data that cannot be shared for explainable reasons) gives the ability to study some of the systems biases and demands a detailed description of the data – what it is, how it was collected, its characteristics, and so on – so that users can understand the biases and categorization underlying the system.

OSI believes these extra requirements for data beyond the preferred form of making modifications to the AI system both advance openness in all the components of the preferred form of modifying the AI system and drive more Open Source AI in private-first areas such as healthcare.

Other changes

  • The Checklist is separated into its own document. This is to separate the discussion about how to identify Open Source AI from the establishment of general principles in the Definition. The content of the Checklist has also been fully aligned with the Model Openness Framework (MOF), allowing for an easy overlay.
  • Under “Preferred form to make modifications,” the word “Model” changed to “Weights.” The word “Model” was referring only to parameters, and was inconsistent with how the word “model” is used in the rest of the document.
  • There is an explicit reference to the intended recipients of the four freedoms: developers, deployers and end users of AI systems.
  • Incorporated credit to the Free Software Definition.
  • Added references to conditions of availability of components, referencing the Open Source Definition.

Next steps

  • Continue iterating through drafts after meeting diverse stakeholders at the worldwide roadshow, collect feedback and carefully look for new arguments in dissenting opinions.
  • Decide how to best address the reviews of new licenses for datasets, documentation and the agreements governing model parameters.
  • Keep improving the FAQ.
  • Prepare for post-stable-release: Establish a process to review future versions of the Open Source AI Definition.

Collecting input and endorsements

We will be taking draft v.0.0.9 on the road collecting input and endorsements, thanks to a grant by the Sloan Foundation. The lively conversation about the role of data in building and modifying AI systems will continue at multiple conferences from around the world, the weekly town halls and online throughout the Open Source community.

The first two stops are in Asia: Hong Kong for AI_dev August 21-23, then Beijing for Open Source Congress August 25-27. Other events are planned to take place in Africa, South America, Europe and North America. These are all steps toward the conclusion of the co-design process that will result in the release of the stable version of the Definition in October at All Things Open.

Creating an Open Source AI Definition is an arduous task over the past two years, but we know the importance of creating this standard so the freedoms to use, study, share and modify AI systems can be guaranteed. Those are the core tenets of Open Source, and it warrants the dedicated work it has required. You can read about the people who have played key roles in bringing the Definition to life in our Voices of Open Source AI Definition on the blog.

How to get involved

The OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:

  • Join the forum: share your comment on the drafts.
  • Leave comment on the draft v.0.0.9: provide precise feedback on the text of the latest draft.
  • Follow the weekly recaps: subscribe to our monthly newsletter and blog to be kept up-to-date.
  • Join the town hall meetings: we’re increasing the frequency to weekly meetings where you can learn more, ask questions and share your thoughts.
  • Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.
3 Likes

The OSI x account asked us to post this here, we had emailed about this previously… (see email sent about 1 year ago) …

Subject: seeking input - attribution of openai chatgpt in open-source code published by adafruit
Date: September 9, 2023 at 1:23:06 PM EDT

We’re proposing that part of Open-Source + AI would be to be transparent about the prompts used, we created and demonstrated specific examples and cite these on our GitHub repositories, including the: text used, prompt, logs/and direct links to “conversations” with the LLM and developer.

In the draft there is a “code” section:

Code: The source code used to train and run the system, made available with OSI-approved licenses.

We think adding prompts would be helpful to consider, along with citing the model / model version used and if possible the log, with OpenAI we’ve been linking directly to the prompts and responses.

Check out our video, and article “Writing an Arduino driver with OpenAI ChatGPT and PDF parsing” and here’s an example of how we are crediting the AI “partnership” when publishing open-source code.

the forums would not allow the video, so here it is as well…

and here is the image, the forums will not allow us to post the video and images in one post (limit 1 embedded media per post it seems).

driver

I believe that posting limits are common for new accounts.

Hi!

Welcome! Note that this is a consensus process so the only viable input will be in open spaces like this one; OSI itself is facilitating and not dictating. Apologies if this was not clear, but you are now in the right place!

Surely that is an implementation-specific detail of certain systems in certain configurations? I would be interested in your proposals for edits to the current draft that implement this in terms applicable to any system and not just prompt-driven LLMs which, though important, are a subset.

Cheers

Simon

hi simon, wow - it’s great to see you here, as you have context with open-source hardware (pt here) if i recall correctly you had issues with the open-source hardware logo that was based on my design which was used for OSI and we resolved it together (thank you for that) - https://www.oshwa.org/wp-content/uploads/2012/08/233124698_1.pdf

here is what we are proposing for an AI addition to the OSI def, as well as the open hardware def…

Inspection of Prompts and Data Access Transparency:

In addition to the existing requirements, the preferred form for making modifications to a machine-learning system shall include access to the prompts and commands used during the training phase and/or code and hardware creation. This will enable users to understand the context in which the model was developed, including:

  • Prompt Transparency: Access to a detailed log of all prompts, commands, and instructions used during the training phase and/or code and hardware creation, ensuring that users can see the exact inputs that shaped the model’s behavior.
  • Justification and Documentation: Each prompt should be accompanied by documentation explaining its purpose, how it was constructed, and its expected impact on the model’s development.
  • Replicability and Testing: The framework should provide means for users to replicate prompt scenarios to test modifications and understand their effects on the model’s outputs.
  • Prompt and Model Linking: Direct links to the specific model versions used along with the corresponding prompts, enabling a traceable lineage from input to model behavior.
  • Timestamp and Metadata Documentation: Each entry of the prompt log should be timestamped and include metadata such as the version of the model used at that time.
  • Public Access to Logs: Where possible, logs of the prompts should be made publicly available, with links provided in the documentation to ensure that users can review the historical context and development trajectory of the model.

This addition aims to enhance transparency and foster an environment where users can more effectively audit, replicate, and modify AI behavior.

Wow, twelve years! It does seem so long ago!

Right, my observation is this seems pretty specific to a single approach, no? I admit I am only tagging along here as dealing with public policy is all-consuming…

This definition does not grant the freedom to modify and is unacceptable as an Open Source Definition.

With AI models, the weights are the user interface. I can use them directly as a user. They are what is typically distributed to everyone.

The actual source of the model comes from the data AND the code. The weights are built using the code and the data. Together they make up the ability to reproduce and modify the original.

The weights are the program and they can not be built/compiled without access to both the code AND the data.

As an analogy, imagine if every running/compiled version of PostgreSQL disappeared from existence. The moment after that I could recreate that binary and run it because I have access to the source code.
Now imagine the same happened to the model weights for model that does not share their data. There is no way I could ever reproduce those same weights from just the code they ran.

Yes I can modify the weights after the fact but that is the same as adding a separate package rather than truly modifying the source. There is no possible way for me to run the code on a different data set and get the same weights. Modifying the weights after the fact with new data (aka fine tuning) would not produce the same weights as adding that data to the original data.

I understand some models are built on proprietary of sensitive data. In this case, all that can be shared is the weights. That is still amazing and great but call it something other than open source - how about open weights.

If someone said there was a proprietary piece of code required for their FOSS software, you would say then the project is not Open Source. It can still be a great and valuable contribution but it is not Open Source. This is the same logic you used for the commercial shared source licenses.

The ones muddying the water are the people loosely using the terms open source and creating the problem. As one example of how the term of open source is being abused look at llama3 (which is a great model I love and I am glad they share the weights), On their hugging face page you can not download even the weights without agreeing to give your email. Item 2 of their license agreement has restrictions on terms of use "

  1. Additional Commercial Terms. If, on the Llama 3.1 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights.

The term is being abused by those who want to get the benefit of the label but not follow the definition.

1 Like

Totally agree with @thesteve0.

Systems based on machine learning techniques are composed of two kind of software: a virtual machine (with a specific architecture) that basically maps vectors to vectors, and a set of “weights” matrices that constitutes the software executed by such virtual machine (the “AI model”).

The source code of the virtual machine can be open source, so that given the proper compiler, we can create an exact copy of such software.

In the same way, the software executed by the virtual machine (usually referred to as “the AI model”) is encoded in a binary form that the specific machine can directly execute (the weight matrices). The source code of such binary is composed of all the data required to recreate an exact copy of the binary (the weights). Such data include the full dataset used but also any random seed or input used during the process, such as, for example, the initial random value used to initialize an artificial neural network.

Even if the weights are handy to modify an AI system, they are in no way enough to study it.

So, any system that does not provide the whole dataset required to recreate an exact copy of the model, cannot be defined open source.

Note that in a age of supply chain attacks that leverage opensource, the right to study the system also has a huge practical security value as arXiv:2204.06974 showed that you can plant undetectable backdoors in machine learning models.

Thus I suggest to modify the definition so that

Data information: Sufficiently detailed information about all the data used to train the system (including any random value used during the process), so that a skilled person can recreate an exact copy of the system using the same data. Data information shall be made available with licenses that comply with the Open Source Definition.

Being able to build a “substantially equivalent” system means not being able to build that system, but a different one. It would be like defining Google Chrome as “open source” just because we have access to Chromium source code.

When its training data cannot legally be shared, an AI system cannot be defined as “open source” even if all the other components comply with the open source definition, because you cannot study that system, but only the components available under the os license.

Such a system can be valuable, but not open source, even if the weights are available under a OSD compliant license, because they encode an opaque binary for a specific architecture, not source code.

Lets properly call such models and systems “freeware” and build a definition of OpenSource AI that is coherent with the OpenSource one.

Shamar,

Is “exact copy” actually possible? That is, imagine you had the complete training data for a particular OSAI system. Would you be able to create an exact copy of it? Or would you end up with one that’s very similar, but different in small and subtle ways?

Under Japan’s copyright law, the reproduction of copyrighted works without permission for AI training purposes is generally considered permissible. Therefore, copyrighted works that are publicly available on the internet can be used for AI training without obtaining permission. However, the datasets created for training cannot be widely distributed to the public unless the copyright holders of those works have explicitly granted permission for such use. This follows the basic principles of copyright.

From the perspective of AI developers, this means that even if they have conducted AI training within the legal limits set by copyright law, they cannot freely distribute the datasets created for that training. AI vendors have no reason to obtain permissions from copyright holders for uses of copyrighted material beyond training purposes, and copyright holders have no incentive to consider any concessions specifically for AI training. This suggests that there is unlikely to be significant demand for completely free and open datasets.

Not only in Japan, but eventually in many jurisdictions, the reproduction of copyrighted works without permission for AI training purposes will likely be recognized. At the same time, making all the data used for such training completely free and open is unlikely to be permitted. However, if all the necessary data and the complete information required to create the dataset are made publicly available, then in principle, equivalent AI training could be conducted.

The current OSAID reflects this understanding and seems to strike a reasonable balance within the framework of global intellectual property rights. I think adding more specific language could be helpful, but so far, I haven’t come up with the right wording.

I’ll start my comment with two general observations:

  1. The Open Source Definition came following 15 years of experience with Free Software. Now, there is a lot less practical experience to inform the OSAID work.
  2. The main tension seems to me to be between the binary requirement and the pragmatic requirement. By the binary requirement I refer to @stefano 's statement:

Open Source is a binary concept: you’re either granting end users
and developers the right to have self-sovereignity over the
technology or you’re not.

By the pragmatic requirement, I refer to the requirement that some current systems must fit the definition. I very much agree with @stefano 's sentiment in the binary requirement, either essential freedoms are granted or they are not. However, it is also entirely possible that no current system does that, that we simply haven’t yet figured out to secure those freedoms in an AI world.

I also understand the urgency. This combined suggests to me that the OSAID must expect iterations and refinements (like most things in the industry). If the OSI is prepared to take on this long-term task, then I suggest a slight shift of focus to what must be done now, and what can change in (near) future.

I find that there are many weighty arguments against requiring open training data. Especially the (sad) situation of copyright and the colonization argument. I don’t know how much we can expect for copyright to change, but I have seen a substantial change in sentiment from 20-25 years back. As for the colonization argument, I think it is worth much more attention than it has had, and that we should proceed with care, but I also think that it can be addressed with OSS to enable disadvantaged people to find value in their data. I also think it is important to consider the federated training scenario. I could imagine (though I don’t know if it is practical) that heart data could be gathered from smart watches to give early warnings of heart disease, or say an impending heart attack. If this can’t be open source, then it would certainly be a proprietary Apple model…

Having been an open data advocate for 25 years, I find it very difficult to accept a the current definition, even though I acknowledge these arguments. There are two main points I would like to bring up:

  1. Given that the definition can be updated, it seems to me to be better to have a requirement that can be relaxed as more information becomes available. I.e. it is easier to require open data now and relax that constraint later, then to say that “you were open source by the old definition, but not by the new one”.
  2. The WGs studied four systems and found that open data was not necessary for the four freedoms. I acknowledge that. But how about the reverse, can we come up with concrete situations where the four freedoms would be restricted if training data was not available? And by that, I don’t mean hypothetical examples, nor analogies, but real world examples. I admit to not having been sufficiently deep in this area, so I am unable, but for those arguing against the current definition, this is the challenge I put to you, as I believe this is needed for the argument to be compelling.

Finally, it seems that the current definition needs some wordsmithing, especially around the “skilled person” and “equivalent system”. My experience is that such things are hard to get done online, but I would strongly encourage the OSI to bring it up in an upcoming F2F workshop.

I understand that the situation is complex, and overall I think you have made a good job. I am not quite comfortable about the lack of open data requirement, and would hope to see that further addressed. Also, I note (again) that governance is not being addressed and that it needs to be to alleviate the openwashing problem, and that such work should be encouraged even if it is out of scope for the current work.

Sure, it’s perfectly possible.

If you have all the data used during the training (including random seeds, initial values, learning rates, elements used on each batch and so on…), the whole information relevant to repeat the process and functionally equivalent hardware, you can replicate the process and obtain the exact same system.

You can try for yourself, but it’s quite obvious: we are not talking about quantum computing, thus each step of the process is strictly deterministic and reproducible if you don’t throw away relevant data (such as, for example, the values provided by random sources during the process).

At least in the case of deep learning models, I believe the general understanding is that while it is technically possible to replicate a “functionally equivalent” model, achieving a bit-for-bit exact reproduction is extremely difficult. This is because minor differences at the computational level, arising from slight variations in time or environment, can compound significantly during the training process, leading to slight changes in the final model. As far as I understand, the long discussions we have had up until now have been based on this premise.

I am not an expert in ML/LLM, so I am not sure if this understanding is correct, but even if a completely accurate copy were possible, it might not be of much significance if it is not a method accessible to everyone.

Such premise is wrong.

Given all the data and information, the bit-for-bit exact reproduction is possible even in deep learning models.
It’s even easy if the computing environment is properly set up.

It doesn’t matter: even compiling Debian from scratch is not a method accessible to everyone, yet we distinguish between software that is open source and software that isn’t.

Even the fact some legal system might prevent to distribute the training dataset is irrelevant: it will always be possible to create models form data that can be shared, so it’s not an issue for the applicability of the definition.

Furthermore hardware that is expensive today, will be cheaper in a few years both for usual market dynamics and because of antitrust law applied to trusts like the NVidia / Microsoft / OpenAI one.

Finally the Open Source AI shouldn’t be narrowly optimized for the currently hyped techniques (GPT anyone? :smiley:).
This is a field of frequent disruption and we should not pretend to believe that what is difficult to achieve today will be difficult tomorrow. In fact, still today, several AI techniques are trivial to replicate exactly given all the relevant data and informations.

If we weaken the OpenSource AI definition so that it allows proprietary software that nobody can really inspect to pose as “open”, we damage every truly open source AI that could and would provide the training data and we would disincentivize such basic transparency even when legally possible.

If we weaken the OpenSource AI with such huge loophole, it will be exploited from bad actors, extending the loophole to legal exceptions designed to benefit projects that really contribute to the commons by providing all the freedoms of open sources with full transparency.

We won’t have any way to verify that the dataset declared were actually used during the creation of the models, that they were used as described and so on. Users will run open-washed proprietary software they cannot really inspect, study or modify instead of truly open alternatives.

The fact that recreating certain deep learning models is complex and expensive doesn’t means that it’s impossible (and in fact, it’s always possible, if no relevant data are kept hidden) and the OpenSource AI definition shouldn’t allow systems that prevent the exercise of such right/freedom to pose as “open”.

How about using the phrase “Sufficiently detailed information about the data used to train the system, to the extent possible”? The term “extent possible” is frequently used in the clauses of Creative Commons, which shows that it is a well-established expression. Additionally, I believe it could help close the “huge loophole” you mentioned to some extent.

Using “to the extent possible” would led bad actors to discard relevant training data (even just the values provided by random sources) to be able to argue in curt that providing such data would be not possible.

The loophole is still there.

On the other hand a wording along the line of

Sufficiently detailed information and all the data used to train the system (including any random value used during the process), so that a skilled person can recreate an exact copy of the system using the same data.

Would simply mean that if you cannot or want not to provide any of the required data sources, your AI system would not be Open Source.
It could still be useful, innovative, with weights completely dedicated to the public domain… but not Open Source.

I mean, it’s on the name of the definition: if the full source is not available, how could it be classified as Open source?

Heck, it’s not even “source-available”!