Why and how to certify Open Source AI

stefano · May 18, 2024, 3:48pm

This is important and deserves its own separate thread. The question of certification was raised also by @fontana in the thread Is the definition of "AI system" by the OECD too broad? - #11 by fontana. I think now we have more elements to continue that conversation.

There are many questions that need to be answered and I’d like to hear what people think.

who exactly needs a certification that an AI system is Open Source AI?
who is going to use such certification? Is anyone of the groups deploying open foundation models today thinking that they could use one? For what purpose?
who is going to consume the information carried by the certification, why and how?

These are the first ones that pop to my mind.

zack · May 19, 2024, 4:09pm

Thanks for splitting the thread, it is indeed an important separate discussion.

I think the need from “certifying” an AI system as OSAID compliant or not will emerge primarily from the following situation. The definition is not, and will never be, completely unambiguous. In OSAID 0.0.8 we have the “sufficiently”, “substantially equivalent”, etc. expressions mentioned in the parent post. Even with the proposed changes, we have “high quality equivalent synthetic dataset”. And no matter how hard we try, there will always be margins for different interpretations.

As soon as two parties will disagree on the OSAID compliance of a system, people will want a judge of sorts. For the OSD, OSI has been such a judge, via the license-review process. (Which was quantitatively easier to manage, because there were way fewer licenses than software products under such licenses. With OSAID we’re potentially looking at one judgment call per system…)

OSI will certainly be the first actor the community will turn to for such judgment calls.

amscott · May 27, 2024, 1:35pm

Thanks also for starting this seperate thread.

I suspect the majority of folks involved with AI systems will be able to use the OSAID to “self-certify” e.g. “I meet the standard so I can use the open source label”. Lots of this will be straightforward and not controversial and needs minimal bottlenecks or external interference.

I agree with Zack that the most likely scenario where someone wants a form of objective “certification” will be where some kind of arbitration is required. Obvious mis-alignment with the OSAID will also be easy. The real work will be in nuanced edge cases.

Focussing on the arbitration element rather than some all-purpose certification process in my view is worth seriously considering.

The OSD and the development of OSS licenses has benefitted from over 25 years of community practice and discussion. We have a better and more informed understanding of how the OSD works in practice with good precendents to point to and expanded guidance alongside the OSD that supports the development of new licenses (https://opensource.org/licenses/review-process).

The OSAID is inevitably going to go through a similar maturation process, with the same discussions, precedent setting, and emergent good practice. There’s merit in thinking about how to best support that as ultimately it will strengthen the OSAID as it has for the OSD.

Having said all of that, I also wonder whether a simple self-certification tool/register would be of use to the community? Something quick which takes about 5 mins to fill in and checks whether a system aligns with the definition or not (based on stating which licenses apply to which components), with potentially some info about versions and locations of components? Beyond making it really easy to check alignment with the OSAID it creates a registry of what good practice looks like and promotes transparency.

shujisado · May 27, 2024, 2:42pm

In 1998, Perens and ESR declared a marketing campaign to promote free software to Wall Street, and they began using the term “Open Source”. We believed that was the right move and supported their idea. Currently, I do not feel the same excitement for Open Source AI as I did in 1998, and sometimes I still question its significance.

However, for companies like the one I belong to that are involved in the AI business, having the label of Open Source AI would be desirable if possible. Moreover, I believe we have a duty to leave a legacy of free and transparent AI for future generations. For these reasons, the act of certifying Open Source AI has its significance.

However, I am not sure how many entities will be able to obtain such certification.

Mark · May 31, 2024, 7:55pm

I think the issue of certification may also come up in relation to the EU AI Act. After all, this act puts legal weight on the term “open source” and provides open source systems with certain exemptions — exemptions which are apparently attractive enough to induce companies like Meta and Mistral to go all-in on co-opting the term “open source”.

Now, the EU AI Act stipulates a “template” that specifies some forms of disclosure even for “open source” systems, and an “AI Office” that will draw up this template and presumably oversee its enforcement (though by which processes and with what powers, is unclear at this point). As we point out in our FAccT paper:

If this exemption or one like it stays in place, it will have two important effects: (i) attaining open source status becomes highly attractive to any generative AI provider, as it provides a way to escape some of the most onerous requirements of technical documentation and the attendant scientific and legal scrutiny; (ii) an as-yet unspecified template (and the AI Office managing it) will become the focus of intense lobbying efforts from multiple stakeholders (e.g., [12]). Figuring out what constitutes a “sufficiently detailed summary” will literally become a million dollar question.

The ‘million dollar question’ is not even hyperbolic — Meta currently spends 8 million annually in Brussels to lobby on the DSA and EU AI Act. If organisations like OSI and other open source players can play a role in regulation and certification (and especially in ensuring and advocating maximum transparency) it seems this might strengthen the open source ecology.

Alek_Tarkowski · June 10, 2024, 6:32pm

I agree with @Mark that certification of open source AI will be a key issue in the context of the AI Act. I would like to point out that in this context the open source status of the licenses themselves will also be important.

The AI Act uses the term “under a free and open-source license”, which is itself quite confusing (I would expect “free and open”). One can assume that the definition is pretty clear and basically covers OSI-compliant licenses. But I think that one could just as well attempt to argue that responsible AI licensing fits the broad definition that’s included.

Looking at it differently, the issue of responsible AI licensing as a form of open-source licensing remains an open one today. And it needs to be resolved, so that there is clarity on the AI Act’s open source exemptions.

There is a possible simple answer: these licenses are not OSI-compliant, so they are not open source. But I am not sure that it will suffice. That’s because responsible AI licenses are getting some significant traction with developers – when you look at HuggingFace data, for example. So as a legislator I could see the sense of having them exempted from some of the regulation as well.

But to complicate things even further, there are at least several licenses – like the LLama or Falcon license – that introduce restrictions but are dubbed as “open source”. So the European AI Office, as it aims to clarify open-source licensing, might face pressure to accept a definition that includes these various responsible / restrictive licenses.

The conversation on this forum has focused on issues related to systems and their compliance with OSAID. I think that all this points to the need to reach consensus on licenses themselves.

jberkus · June 21, 2024, 5:48pm

One thing I think we need to have a discussion on, and sooner rather than later, is whether the Open Source AI Definition is going to apply to licenses or systems. A lot of the language in the drafts and here on the board is expressed in terms of systems. However, reviewing systems as OSS/proprietary would involve a certification program. Is that where this is headed?

shujisado · June 21, 2024, 10:29pm

Yes. Currently, discussions are progressing towards certifying AI systems in the future.

AI systems are composed of many components. Each of these components needs to be considered separately to determine whether they can be deemed open source. Based on the results of these individual evaluations, we can determine whether the AI system as a whole is open source.

This logic seems reasonable to me. On the other hand, OSAID will not be a definition for licenses, but for systems. Therefore, a mechanism to certify individual AI systems will be necessary. I am concerned about whether the OSI can establish such a mechanism. It would mean a significant expansion of the OSI’s role from reviewing licenses to certifying systems.

stefano · June 26, 2024, 10:20am

Let’s put more thoughts in this thread since there were some news last week that may inform the conversation.

The concept of “preferred form to make modifications” in the OSAID leaves some space to interpretation but so does the OSD and the Free Software Definition, both have been object of furious debates to interpret corner cases (if interested, read the debates that CAL generated, spanning many months).

In practice and for the examples we’ve seen so far, it’s fairly easy to spot an Open Source AI by generalizing the examples of Pythia, OLMo, LLM360 and BigScience’s models (if they change their licenses.)

By looking at the results of the validation and @amcasari’s comment the issue is not much in the ambiguous terms on the Definition and understanding what the preferred form to make modifications is. The issue seems to be more in the Checklist below, and how difficult it is to find the required components for someone that is not the original developer.

This problem is felt also by the Linux Foundation, which the Checklist is based on.

To address it, the LF released last week the MOF Tool https://mot.isitopen.ai/. This tool allows the original developers of AI systems to add links to the components of their systems and their licenses. @lf_matt_white can explain better how that works internally. This has the potential of becoming an industry standard, given the size of the LF.

We already adapted the Model Openness Framework to the Open Source AI Definition, so I can imagine that the OSAID compliance could become an overlay of the MOF, displayed on the tool or somewhere else where it matters.

Does anybody want to play with the MOF tool and the OSAID? I’m happy to provide support.

But there could also be other frameworks (like Mozilla’s model), too like @amcasari asked Concerns and feedback on anchoring on the Model Openness Framework.

That’s to be expected but we don’t know what the future of AI looks like… Today I can see one reason for a model to show compliance to the Open Source AI Definition:

This! One of the reason for OSI to start this process is exactly to be able to offer a reference for policy makers, one Definition supported by a large variety of interest groups and maintained by a neutral group. Hopefully we’ll get one on time.

On licenses:

I tend to agree: the OSI License Committee hasn’t been asked if they intend to start evaluating licenses for data and datasets, for documentation and ultimately terms/covenants/agreements/contracts for model parameters. We should ask them to do now… Anybody wants to open a separate thread?

jberkus · July 11, 2024, 10:58pm

My real concern here is the required resources to certify systems. We’re talking about two orders of magnitude expansion of the labor requirement to certify something, or not. As a parallel, I was involved in certifying systems for TPC benchmarks once upon a time, and that took 10-30 hours per system, and most of those were just 2-10 machines, not the kind of scale we’re dealing with for LLMs. We’d be talking about weeks of work for each system.

This is not something we can do with a volunteer committee like License Review.

shujisado · July 12, 2024, 1:49am

Example Licenses
The GNU GPL, BSD, X Consortium, and Artistic licenses are examples of licenses that we consider conformant to the Open Source Definition. So is the MPL.

The earliest version of the OSD 10 clauses stated as above.
At that time, there were many controversies such as whether the Artistic licenses were truly free or whether MPL and QPL should be recognized as open source. I always worried that this free and open community might split.

Currently, the “Open Source AI” we are discussing in this forum has the potential to have a greater economic impact on the general society than “Open Source” did back then. Being recognized as Open Source AI will generate value beyond just a title.
Therefore, there will be bigger and more numerous issues in the approval process than the problem of recognizing MPL or QPL as open source.

The current license review process is more rigorous than before, involving experts in licenses and law. However, as Josh-san argues, it still relies heavily on the goodwill of volunteers. And whether we can bear the responsibility of granting the title of Open Source AI… that might be difficult.

Perhaps, in the process of the review, we will need the cooperation of the machine learning community or consortia to discuss technical issues, and a system to monitor this neutrally may also be necessary. Just thinking about it a little, it seems that a very large organization would be needed. I am concerned about whether OSI can build that.

But if there is anything I can do to help, I am willing to assist.
At this point, what I can do is to spread the word about OSAID within the Japanese ML/LLM development community, and I am putting that into practice.

jberkus · July 12, 2024, 5:05pm

Shujisado-san,

You misunderstand me; I am saying that the OSI will need full-time paid staff to do the certifications. It cannot be done with volunteers.