Share your thoughts about draft v0.0.9

nick · August 22, 2024, 12:42pm

We are excited to share the draft v0.0.9 of the Open Source AI Definition with you. This draft represents the culmination of our open process, but we recognize that there may be differing perspectives on its content.

Your Input Matters

As always, we invite you to review the draft and share your thoughts, especially any disagreements or concerns you may have. Your feedback is crucial in ensuring that our definition is as comprehensive, accurate, and fair as possible.

How to Share Your Feedback

Please be specific: If you disagree with any part of the draft, let us know which section you’re referring to and provide a brief explanation of your reasoning.
Offer alternatives: If you believe something can be improved, feel free to suggest alternative wording or approaches.

Community Guidelines

We encourage a constructive and respectful dialogue. As always, please adhere to our Code of Conduct. Let’s ensure that this discussion remains civil and productive.

What’s Next

We are constantly collecting and reviewing feedback from a wide range of stakeholders. We will make necessary revisions toward a stable version of the Open Source AI Definition, which is expected to be announced at All Things Open in October.

Thank you for your time and input. We look forward to hearing your thoughts!

shujisado · August 22, 2024, 3:02pm

This is excellent. I thought there wouldn’t be any major changes in this version 0.0.9, but I am pleasantly surprised.

I believe it was a wise decision to separate the checklist document from OSAID. This makes the intended thinking behind OSAID easier to understand.
Additionally, splitting the model into “code” and “weights” is a very good idea. At least in Japan, the “code” portion as defined in OSAID 0.0.9 clearly falls under copyright, so it makes sense to separate it from the “weights” portion.

I also noticed when looking at the checklist that there have been revisions to consider those who advocate for the existence of a complete dataset. While I cannot agree with the argument that datasets should be mandatory, I understand the sentiment behind the request. I don’t think there is a significant change in the substantive meaning compared to version 0.0.8, but it does send a message that datasets are also important.

mkai · August 25, 2024, 8:11pm

I think overall it’s clean and easily understood. While I know it’s not the intent of the OSI to cover all aspects of the use of AI in this definition, there are two things I’d like to offer for consideration.

Technically speaking, since the USPTO and many international organizations have ruled that purely AI-generated content can be used without license or royalty, and cannot be trademarked have a copyright enforced, how would the OSI address topics of AI generated content created by closed source models and open source models alike?
This line at the end: “The Open Source AI Definition does not take any stance as to whether model parameters require a license, or any other legal instruments, and whether they can be legally controlled by any such instruments once disclosed and shared.”

My concern is that the parameters themselves fall under the “Study” and “Modify” components of the Open Source AI definition. There should be some clarifying language to discuss the difference between a license that might apply to a parameter that is disclosed as part of an open source model vs the open source model itself. i.e. is there a transitive application of open source AI definition which applies to all aspects of the model defined in the document? Where is the cut point?

shujisado · August 25, 2024, 11:55pm

There has been a lot of media coverage regarding the release of version 0.0.9, which is encouraging. However, one thing caught my attention in the final part of the article linked above. I quote it below:

She adds that OSI is planning some sort of enforcement mechanism, which will flag models that are described as open source but do not meet its definition. It also plans to release a list of AI models that do meet the new definition. Though none are confirmed, the handful of models that Bdeir told MIT Technology Review are expected to land on the list are relatively small names, including Pythia by Eleuther, OLMo by Ai2, and models by the open-source collective LLM360.

It mentions that an enforcement mechanism is being planned to flag false open source AI, but I am not enthusiastic about this. I believe OSI’s statement regarding Llama was effective, but I do not wish to see labels being regularly applied. It seems to be different from OSI’s traditional stance, so I hope this is a misunderstanding by the journalist.

Another point is that while I understand that Pythia, OLMo, and LLM360 comply with versions up to OSAID 0.0.8, if models are to be announced as meeting the standards, it would be wise to follow a process to confirm that they meet the criteria according to the latest definition at that time.

nick · August 26, 2024, 11:20am

It mentions that an enforcement mechanism is being planned to flag false open source AI, but I am not enthusiastic about this. I believe OSI’s statement regarding Llama was effective, but I do not wish to see labels being regularly applied. It seems to be different from OSI’s traditional stance, so I hope this is a misunderstanding by the journalist.

Hi @shujisado -san, I believe your are right. The OSI currently is not planning to certify Open Source AI systems.

jplorre · August 28, 2024, 5:10pm

Regarding this new version, we at LINAGORA have the following comments:

Indeed there is no significant difference between v0.0.9 and v0.0.8. The idea is still not to explicitly require training data to be published, but to require sufficient information to be provided so that an equivalent system may be recreated.
Although this choice does not correspond to our initial position, we understand that it is a path of compromise.

However, we note that the notion of “equivalent system” is not specified, which is open to different interpretations. We therefore propose adding a sentence to clarify this notion. Such a sentence may be “two systems are said to be equivalent if they produce the same outputs given identical inputs”.

Another point is about the requirement of the “supporting libraries like tokenizers and hyperparameters search code” of the code bullet in the “Preferred form to make modifications to machine-learning systems” part. We think that tokenizer is very specific to LLM systems and doesn’t apply to other generative AIs so we propose to withdraw this specific reference and keep only “supporting libraries and hyperparameters search code”.

Given these considerations, LINAGORA is prepared to support the proposed definition.

shujisado · August 29, 2024, 12:36pm

I agree with the idea that there should be supplementary wording to explain the term “equivalent system.” However, in the case of general LLMs, even completely identical systems cannot guarantee that the same output will be generated for the same input. We may need to consider other wording. Moreover, the appropriate term to reinforce “equivalent system” had better placed on the checklist side rather than in the OSAID itself. Final determining whether an AI system is open source is, in my view, the task of the checklist.

marianataglio · September 7, 2024, 4:23pm

I think it is a great starting point. To assess transparency, I believe it is important to also include the detailed technical aspects of model training, such as hardware specifications, training time and carbon footprint (if available). Sharing this information would benefit the community by promoting reproducibility, improving accessibility, and enabling more efficient collaboration. Transparency around resource requirements also helps practitioners estimate the computational cost of replicating or adapting models, encouraging more responsible and optimized use of hardware.
In this sense, knowing the hardware setup might allow researchers to recreate the training environment and thus better understand the computational power required. And it would also help developers optimize models for more efficient training on available resources.