Draft v.0.0.9 of the Open Source AI Definition is available for comments

Originally published at: Community input drives the new draft of the Open Source AI Definition – Open Source Initiative

The Open Source AI Definition v0.0.9 has been released and collaboration continues at in-person events and in the online forums. Read what changes have been made, what to do next and how to get involved.



A new version of the Open Source AI Definition has been released with one new feature and a cleaner text, based on comments received from public discussions and recommendations. We’re continuing our march towards having a stable release by the end of October 2024, at All Things Open. Get involved by joining the discussion on the forum, finding OSI staff around the world and online at the weekly town halls.

New feature: clarified Open Source model and Open Source weights

  • Under “What is Open Source AI,” there is a new paragraph that (1) identifies both models and weights/parameters as encompassed by the word “system” and (2) makes it clear that all components of a larger system have to meet the standard. There is a new sentence in the paragraph after the “share” bullet making this point.
  • Under the heading “Open Source models and Open Source weights,” there is a description of the components for both of those for machine learning systems. We also edited the paragraph below those additions to eliminate some redundancy.

Training data in the preferred form to make modifications

The role of training data is one of the most hotly debated parts of the definition. After long deliberation and co-design sessions we have concluded that defining training data as a benefit, not a requirement, is the best way to go.

Training data is valuable to study AI systems: to understand the biases that have been learned, which can impact system behavior. But training data is not part of the preferred form for making modifications to an existing AI system. The insights and correlations in that data have already been learned.

Data can be hard to share. Laws that permit training on data often limit the resharing of that same data to protect copyright or other interests. Privacy rules also give a person the rightful ability to control their most sensitive information, such as decisions about their health. Similarly, much of the world’s Indigenous knowledge is protected through mechanisms that are not compatible with later-developed frameworks for rights exclusivity and sharing.

  • Open training data (data that can be reshared) provides the best way to enable users to study the system, along with the preferred form of making modifications.
  • Public training data (data that others can inspect as long as it remains available) also enables users to study the work, along with the preferred form.
  • Unshareable non-public training data (data that cannot be shared for explainable reasons) gives the ability to study some of the systems biases and demands a detailed description of the data – what it is, how it was collected, its characteristics, and so on – so that users can understand the biases and categorization underlying the system.

OSI believes these extra requirements for data beyond the preferred form of making modifications to the AI system both advance openness in all the components of the preferred form of modifying the AI system and drive more Open Source AI in private-first areas such as healthcare.

Other changes

  • The Checklist is separated into its own document. This is to separate the discussion about how to identify Open Source AI from the establishment of general principles in the Definition. The content of the Checklist has also been fully aligned with the Model Openness Framework (MOF), allowing for an easy overlay.
  • Under “Preferred form to make modifications,” the word “Model” changed to “Weights.” The word “Model” was referring only to parameters, and was inconsistent with how the word “model” is used in the rest of the document.
  • There is an explicit reference to the intended recipients of the four freedoms: developers, deployers and end users of AI systems.
  • Incorporated credit to the Free Software Definition.
  • Added references to conditions of availability of components, referencing the Open Source Definition.

Next steps

  • Continue iterating through drafts after meeting diverse stakeholders at the worldwide roadshow, collect feedback and carefully look for new arguments in dissenting opinions.
  • Decide how to best address the reviews of new licenses for datasets, documentation and the agreements governing model parameters.
  • Keep improving the FAQ.
  • Prepare for post-stable-release: Establish a process to review future versions of the Open Source AI Definition.

Collecting input and endorsements

We will be taking draft v.0.0.9 on the road collecting input and endorsements, thanks to a grant by the Sloan Foundation. The lively conversation about the role of data in building and modifying AI systems will continue at multiple conferences from around the world, the weekly town halls and online throughout the Open Source community.

The first two stops are in Asia: Hong Kong for AI_dev August 21-23, then Beijing for Open Source Congress August 25-27. Other events are planned to take place in Africa, South America, Europe and North America. These are all steps toward the conclusion of the co-design process that will result in the release of the stable version of the Definition in October at All Things Open.

Creating an Open Source AI Definition is an arduous task over the past two years, but we know the importance of creating this standard so the freedoms to use, study, share and modify AI systems can be guaranteed. Those are the core tenets of Open Source, and it warrants the dedicated work it has required. You can read about the people who have played key roles in bringing the Definition to life in our Voices of Open Source AI Definition on the blog.

How to get involved

The OSAID co-design process is open to everyone interested in collaborating. There are many ways to get involved:

  • Join the forum: share your comment on the drafts.
  • Leave comment on the draft v.0.0.9: provide precise feedback on the text of the latest draft.
  • Follow the weekly recaps: subscribe to our monthly newsletter and blog to be kept up-to-date.
  • Join the town hall meetings: we’re increasing the frequency to weekly meetings where you can learn more, ask questions and share your thoughts.
  • Join the workshops and scheduled conferences: meet the OSI and other participants at in-person events around the world.
3 Likes

The OSI x account asked us to post this here, we had emailed about this previously… (see email sent about 1 year ago) …

Subject: seeking input - attribution of openai chatgpt in open-source code published by adafruit
Date: September 9, 2023 at 1:23:06 PM EDT

We’re proposing that part of Open-Source + AI would be to be transparent about the prompts used, we created and demonstrated specific examples and cite these on our GitHub repositories, including the: text used, prompt, logs/and direct links to “conversations” with the LLM and developer.

In the draft there is a “code” section:

Code: The source code used to train and run the system, made available with OSI-approved licenses.

We think adding prompts would be helpful to consider, along with citing the model / model version used and if possible the log, with OpenAI we’ve been linking directly to the prompts and responses.

Check out our video, and article “Writing an Arduino driver with OpenAI ChatGPT and PDF parsing” and here’s an example of how we are crediting the AI “partnership” when publishing open-source code.

the forums would not allow the video, so here it is as well…

and here is the image, the forums will not allow us to post the video and images in one post (limit 1 embedded media per post it seems).

driver

I believe that posting limits are common for new accounts.

Hi!

Welcome! Note that this is a consensus process so the only viable input will be in open spaces like this one; OSI itself is facilitating and not dictating. Apologies if this was not clear, but you are now in the right place!

Surely that is an implementation-specific detail of certain systems in certain configurations? I would be interested in your proposals for edits to the current draft that implement this in terms applicable to any system and not just prompt-driven LLMs which, though important, are a subset.

Cheers

Simon

hi simon, wow - it’s great to see you here, as you have context with open-source hardware (pt here) if i recall correctly you had issues with the open-source hardware logo that was based on my design which was used for OSI and we resolved it together (thank you for that) - https://www.oshwa.org/wp-content/uploads/2012/08/233124698_1.pdf

here is what we are proposing for an AI addition to the OSI def, as well as the open hardware def…

Inspection of Prompts and Data Access Transparency:

In addition to the existing requirements, the preferred form for making modifications to a machine-learning system shall include access to the prompts and commands used during the training phase and/or code and hardware creation. This will enable users to understand the context in which the model was developed, including:

  • Prompt Transparency: Access to a detailed log of all prompts, commands, and instructions used during the training phase and/or code and hardware creation, ensuring that users can see the exact inputs that shaped the model’s behavior.
  • Justification and Documentation: Each prompt should be accompanied by documentation explaining its purpose, how it was constructed, and its expected impact on the model’s development.
  • Replicability and Testing: The framework should provide means for users to replicate prompt scenarios to test modifications and understand their effects on the model’s outputs.
  • Prompt and Model Linking: Direct links to the specific model versions used along with the corresponding prompts, enabling a traceable lineage from input to model behavior.
  • Timestamp and Metadata Documentation: Each entry of the prompt log should be timestamped and include metadata such as the version of the model used at that time.
  • Public Access to Logs: Where possible, logs of the prompts should be made publicly available, with links provided in the documentation to ensure that users can review the historical context and development trajectory of the model.

This addition aims to enhance transparency and foster an environment where users can more effectively audit, replicate, and modify AI behavior.

Wow, twelve years! It does seem so long ago!

Right, my observation is this seems pretty specific to a single approach, no? I admit I am only tagging along here as dealing with public policy is all-consuming…

This definition does not grant the freedom to modify and is unacceptable as an Open Source Definition.

With AI models, the weights are the user interface. I can use them directly as a user. They are what is typically distributed to everyone.

The actual source of the model comes from the data AND the code. The weights are built using the code and the data. Together they make up the ability to reproduce and modify the original.

The weights are the program and they can not be built/compiled without access to both the code AND the data.

As an analogy, imagine if every running/compiled version of PostgreSQL disappeared from existence. The moment after that I could recreate that binary and run it because I have access to the source code.
Now imagine the same happened to the model weights for model that does not share their data. There is no way I could ever reproduce those same weights from just the code they ran.

Yes I can modify the weights after the fact but that is the same as adding a separate package rather than truly modifying the source. There is no possible way for me to run the code on a different data set and get the same weights. Modifying the weights after the fact with new data (aka fine tuning) would not produce the same weights as adding that data to the original data.

I understand some models are built on proprietary of sensitive data. In this case, all that can be shared is the weights. That is still amazing and great but call it something other than open source - how about open weights.

If someone said there was a proprietary piece of code required for their FOSS software, you would say then the project is not Open Source. It can still be a great and valuable contribution but it is not Open Source. This is the same logic you used for the commercial shared source licenses.

The ones muddying the water are the people loosely using the terms open source and creating the problem. As one example of how the term of open source is being abused look at llama3 (which is a great model I love and I am glad they share the weights), On their hugging face page you can not download even the weights without agreeing to give your email. Item 2 of their license agreement has restrictions on terms of use "

  1. Additional Commercial Terms. If, on the Llama 3.1 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights.

The term is being abused by those who want to get the benefit of the label but not follow the definition.

1 Like

Totally agree with @thesteve0.

Systems based on machine learning techniques are composed of two kind of software: a virtual machine (with a specific architecture) that basically maps vectors to vectors, and a set of “weights” matrices that constitutes the software executed by such virtual machine (the “AI model”).

The source code of the virtual machine can be open source, so that given the proper compiler, we can create an exact copy of such software.

In the same way, the software executed by the virtual machine (usually referred to as “the AI model”) is encoded in a binary form that the specific machine can directly execute (the weight matrices). The source code of such binary is composed of all the data required to recreate an exact copy of the binary (the weights). Such data include the full dataset used but also any random seed or input used during the process, such as, for example, the initial random value used to initialize an artificial neural network.

Even if the weights are handy to modify an AI system, they are in no way enough to study it.

So, any system that does not provide the whole dataset required to recreate an exact copy of the model, cannot be defined open source.

Note that in a age of supply chain attacks that leverage opensource, the right to study the system also has a huge practical security value as arXiv:2204.06974 showed that you can plant undetectable backdoors in machine learning models.

Thus I suggest to modify the definition so that

Data information: Sufficiently detailed information about all the data used to train the system (including any random value used during the process), so that a skilled person can recreate an exact copy of the system using the same data. Data information shall be made available with licenses that comply with the Open Source Definition.

Being able to build a “substantially equivalent” system means not being able to build that system, but a different one. It would be like defining Google Chrome as “open source” just because we have access to Chromium source code.

When its training data cannot legally be shared, an AI system cannot be defined as “open source” even if all the other components comply with the open source definition, because you cannot study that system, but only the components available under the os license.

Such a system can be valuable, but not open source, even if the weights are available under a OSD compliant license, because they encode an opaque binary for a specific architecture, not source code.

Lets properly call such models and systems “freeware” and build a definition of OpenSource AI that is coherent with the OpenSource one.

Shamar,

Is “exact copy” actually possible? That is, imagine you had the complete training data for a particular OSAI system. Would you be able to create an exact copy of it? Or would you end up with one that’s very similar, but different in small and subtle ways?

Under Japan’s copyright law, the reproduction of copyrighted works without permission for AI training purposes is generally considered permissible. Therefore, copyrighted works that are publicly available on the internet can be used for AI training without obtaining permission. However, the datasets created for training cannot be widely distributed to the public unless the copyright holders of those works have explicitly granted permission for such use. This follows the basic principles of copyright.

From the perspective of AI developers, this means that even if they have conducted AI training within the legal limits set by copyright law, they cannot freely distribute the datasets created for that training. AI vendors have no reason to obtain permissions from copyright holders for uses of copyrighted material beyond training purposes, and copyright holders have no incentive to consider any concessions specifically for AI training. This suggests that there is unlikely to be significant demand for completely free and open datasets.

Not only in Japan, but eventually in many jurisdictions, the reproduction of copyrighted works without permission for AI training purposes will likely be recognized. At the same time, making all the data used for such training completely free and open is unlikely to be permitted. However, if all the necessary data and the complete information required to create the dataset are made publicly available, then in principle, equivalent AI training could be conducted.

The current OSAID reflects this understanding and seems to strike a reasonable balance within the framework of global intellectual property rights. I think adding more specific language could be helpful, but so far, I haven’t come up with the right wording.