Draft v.0.0.8 of the Open Source AI Definition is available for comments

We reached a milestone with draft v.0.0.8: The Open Source AI Definition is feature complete.

This version of the document has all the elements required to evaluate if an AI system is Open Source. This also mark the appearance of a document to collect all the answers to most frequently asked questions.

Going forward, we expect a lot of debates around word choices and clarifications to the FAQ rather than major changes to the structure of the document.

Changelog

  • incorporated feedback from legal review in Gothenburg and 0.0.7
    • transformed Data transparency to Data information following feedback from the
    • separated the Out of scope section to a FAQ document
    • added mention to frictionless in the preamble
    • moved the definition of preferred form to make modifications to ML above the checklist
  • updated language to follow the latest version of the Model Openness Framework
  • added the legal requirements for optional components
  • first incarnation of the FAQ added

Known issues

See How to describe the acceptable terms to receive documentation?

Also, you’ll notice that the recipient of the freedoms is left implicit: this is because the intention is to leave space for the subject of a (bad) AI-based decision to have the same rights of developers and users. Happy to make this a separate thread and a focus of future co-design workshops.

Next steps

The next steps before the next version are:

  • Widen the call for reviewers in the next couple of weeks
  • Test the Definition with more AI systems (Olmo, Phy, Mistral, etc)
  • Run a review workshop at PyCon US
6 Likes

For convenience, this is the link to the 0.0.8 draft:

5 Likes

After a first read, it looks really good! Congrats!

2 Likes

The whole “Preferred form to make modifications to machine-learning systems” section looks like a wishlist more than part of the definition.

It introduces a whole bag of unprecedented requirements, some of which are only applicable to some kind of AI systems and the rest of which would be applicable for all kinds of software, but has never been part of the OSD, or a requirement for software to be open source. At most, it’s needed for software to be good.

The impression is that what is desirable of an AI system, other than good performance, has been included in the document.
This is in striking contrast with the tradition of the free software (and open source) community of primarily opposing that which is deliberately done to restrict others.

Instead, a massive amount of additional work and a much more rigid and restrictive workflow is effectively required to those developing open source AI systems than to everyone else.

I’m not sure I understand your criticism. Starting from here:

Yes, the definition is supposed to be a sort of wishlist in the sense that it contains the list of AI components that one “wishes” (requires, rather) to be made available.

The header specifically mentions “machine learning systems” because it’s the ML systems that stress out the OSD. I’m not aware of other AI systems that aren’t served by the “classic” Open Source Definition.

I read your comments on hackmd, I believe you’re referring to Blender and the Intel Open Image Denoise… I think the OID should be treated as a library dependency: it’s not a requirement for Blender to ship themselves the whole shebang to rebuild OID. Let’s analyze OID and see if/where the Open Source AI Definition draft fails.

But almost nothing one should wish about any non-AI program is part of the standard OSD.
I wish programs had good documentation, written with proper grammar as well.
I wish the commit history were kept well.
One may wish to have access to all the intermediary assets that lead to the creation of a program (diagrams, etc…) as well as to know the thinking process of the author.
But the only thing that is required, the whole point of open source, is that the author doesn’t restrict others’ software freedom trough legal and technical means and that the program is shared in a form in which it’s always available to the author, about as easy to share as any other form.
Releasing an open source program is generally no harder, no more cumbersome and requires little to no additional work compared to releasing it as proprietary.

Thank you, and apologies for my oversight. I should have compared this draft more carefully to previous versions.
This is just something I’m slightly confused about: it this supposed to refer to all kinds of ML systems or only “blackboxy” ones?
I know neural networks are (deservedly) currently the most prominent system in DL, but I wonder if these requirements are intended to apply to “clearboxes” too (such as decisions tree), which may very well gain popularity again (due to the will to have more decipherable systems, in certain areas).

Me neither, which is why I’d rather define any other AI system as “open source software” than “open source AI”.

Not really, in this instance, although that could be one example.
Under the current draft, releasing, let’s say, an “open source AI” LLM would require much more work than releasing a proprietary LLM, in the form of writing documentation, sharing rather large files, preserving information, restricting the way you train it, etc.
For example, if you train a system interactively and you happen to bodge a bit (tweaking the process with some additional lines of code), you’d arguably have to keep track of that too, since that code contributed to the training process.

It’s also vague to the point that, personally, I’d never feel comfortable describing a deep learning model as “open source AI” under this definition.
I would, of course, not refrain from describing training and testing/inference programs (which, regardless of the model, are just code) as “open source software”, with reference to the OSD.

I see the issue now and this should go in the FAQ, too. The “classic” Open Source Definition is applied to licenses, not to the software. We’ve been working under the axiom that if a program is shipped with a license approved by the OSI then the software is considered Open Source. In the software space that’s generally understood and mostly works fine (although it’s challenged, at times.)

For machine learning systems the OSI can’t simply review licenses as the concept of the “program” in this case is not just the source/binary code. Through the co-design process of the Open Source AI Definition we learned that to use, study, share and modify a ML system one needs a complex combo of multiple components each following diverse legal regimes (not just the usual copyright+patents.) Therefore we must describe in more details what is required to grant users the agency and control expected.

There is vagueness with the Open Source Definition, too. And the interpretation of the OSD has evolved over time.

Some vagueness is fine but there must be clarity with the intentions, the preamble and basic principles we want to achieve: those shouldn’t change. The definition of preferred form to make modifications lists principles about data information, code and model and provides examples of things required to comply. I expect those examples to be used and refined by evaluators in the future… starting now (that’s the reason of the validation phase.)

A framework of reference that may be useful to evaluate if a ML system is granting you the necessary freedoms is to ask yourself:

  • Do I have the preferred form to make modifications to the system?
  • Do I have the elements necessary for me to fork the system and build something new on top of it?

We have time to add clarity to the definition of “preferred form to make modifications”.

Yes, but the amount of ambiguity and drift has been remarkably small, compared to many other ideas in the CS or legal field.
While a gray area exists, it’s at least possible (and common) to write a program that will qualify as open source according to practically anyone (operating systems, governments, companies, etc…) who is apply a (current or past) standard interpretation of the OSD.
I believe this is extremely beneficial.

I don’t see how this can happen with the open source AI definition and, if it doesn’t, I wonder how that will reflect back on how the standard OSD is perceived.
This is especially considering that OSI is in a much different position than it was when it (very successfully) promoted the OSD and the phrase “open source” in a world that was discovering it.
Now the phrase is commonly used in the field of software, in and out of AI (including deep learning), to indicate, most of the time, exactly what OSI intended.

What still is unclear to me is what even counts as a “form of the system”, at this point.

In my mind, a “form” of X has always been a way in which X can be represented. A program can be in binary form, in source code form, in minified form, and it’s still legally and conceptually the same asset, much like an image is still considered the same image if it’s converted in another format (including a lossy one).

Clearly those who have supported the inclusion of datasets, various form of documentations and even past checkpoints as a “form” of the system must mean something else entirely, however, as those assets are not different representations of the model.
I haven’t been the only one to raise objections, so maybe I’m not the only one who still doesn’t grasp this other concept of “form”, or see how it can be squared with any existing idea or practice in open source.

It would help me if I could understand what is even meant as a “form” of the program. “Preferred to make modifications” has a plainer meaning. I can things of many things I’d prefer to have if I plan to modify a computer program or a trained ML system.

I’ve said from the beginning that AI is a whole new thing and that comparing AI to software would not help us move forward. It’s new territory, it’s like going from a 2D to a 3D space.

Start from there: The co-design process revealed that many things are required to modify a trained ML system. Those many things are the form.

Yesterday, Matt White (PyTorch, LF AI etc…) spoke about open-AI (not company name) at an event organized by the Linux Foundation Japan. He also spoke about the Model Openness Framework referenced in OSAID.

His talk reinforced my belief that OSAID might work. Let’s keep going forward with this.

2 Likes

We are at that most beautiful time of the year… the time to fine-tune the text of the definition. So do stop on by the HackMD page and suggest alternate or additional wording to increase the clarity and usability of the OSAID.

1 Like