Initial Report on Definition Validation

Context

We’re aiming to use the Open Source AI Definition (OSAID) to review approximately ten AI systems before releasing RC1 at the end of June.

To this end, we convened four workgroups at the beginning of this year to review an initial group of AI systems self-described as open: BLOOM, Pythia, Llama 2, and OpenCV. These workgroups were composed of system creators and unaffiliated volunteers and results were announced in early April.

Reviewed Systems

To continue towards our ten-system review goal, in early May we posted a call for volunteers on this forum. The ask was to help us validate additional AI systems using v.0.0.8 of the OSAID.

That call for volunteers resulted in the following system list and volunteer reviewers. (Previously reviewed systems* are included to give a complete list of AI systems we are analyzing.) Reviewers completed their analysis on this public spreadsheet.

1. Arctic

  1. Jesús M. Gonzalez-Barahona – Universidad Rey Juan Carlos

2. BLOOM*

  1. Danish Contractor – BLOOM Model Gov. Work Group
  2. Jaan Li – University of Tartu, One Fact Foundation

3. Falcon

  1. Casey Valk – Nutanix
  2. Jean-Pierre Lorre – LINAGORA, OpenLLM-France

4. Grok

  1. Victor Lu – independent database consultant
  2. Karsten Wade – Open Community Architects

5. Llama 2*

  1. Davide Testuggine – Meta
  2. Jonathan Torres – Meta
  3. Stefano Zacchiroli – Polytechnic Institute of Paris
  4. Victor Lu – independent database consultant

6. Mistral

  1. Mark Collier – OpenInfra Foundation
  2. Jean-Pierre Lorre – LINAGORA, OpenLLM-France
  3. Cailean Osborne – University of Oxford, Linux Foundation

7. OLMo

  1. Amanda Casari – Google
  2. Abdoulaye Diack – Google

8. OpenCV*

  1. Rasim Sen – Oasis Software Technology Ltd.

9. Phi-2

  1. Seo-Young Isabelle Hwang – Samsung

10. Pythia*

  1. Seo-Young Isabelle Hwang – Samsung
  2. Stella Biderman – EleutherAI
  3. Hailey Schoelkopf – EleutherAI
  4. Aviya Skowron – EleutherAI

11. T5

  1. Jaan Li University of Tartu, One Fact Foundation

Initial Findings and Obstacles

Unlike the earlier review of BLOOM, Pythia, Llama 2, and OpenCV, in which system creators contributed to system analysis, review of the seven additional systems did not include creators, leaving wide knowledge gaps. This ended up being quite difficult. As a result, this initial report conveys findings about obstacles in the review process as well as findings about the definition:


As this example shows, it was difficult to find documents describing each component (column D) and thus to complete the subsequent analysis (columns F-I).

  • Elusive Documents Not having system creators in the process meant that reviewers were on their own in searching for the legal documents associated with each component. As one reviewer noted in her feedback email, “There is no ‘one and done’ place to see artifacts, licenses, and terms & conditions attached to each component…” As a result, her system and most others had many blanks in both their document list and the subsequent use/study/modify/share analysis. (See example, above.)

  • One Component, Many Artifacts and Documents: A related challenge was that some components are associated with multiple artifacts and multiple documents. Another reviewer noted that, for example, “Source code could be in several repos, and documentation could be in several tech reports or blog posts.”

  • Compounded Components Part of the above problem is due to the fact that some components in the checklist combine multiple artifacts in a single list item, such as training, validation and testing code; supporting libraries and tools; and information on training methodologies and techniques. This also made analysis of the status of any one artifact difficult. As one reviewer noted, “compounding of different kinds of artifacts together made it challenging to track down the legal document for a specific component…”

  • Compliant? Conformant? Of the eleven required components, six require a legal framework which is “compliant” or “conformant” with the Open Source Definition (OSD), rather than simply having an OSI-approved license. Though definitions of conformant (applied to model parameters) and compliant (applied to data information components) was shared during the review process, there was a request for further guidance on how to review components that are not software.

  • Reverting to the License Document analysis (columns F - I) currently requires the reviewer to independently assess whether the legal document (column D) guarantees the right to use, study, modify, and share the component. In the interest of simplifying the process, one reviewer suggested that “we could revert to the analysis of the license,” meaning that if a license or other legal document is OSI-approved, conformant, or compliant, then study, use, modification, and sharing are already guaranteed and no further analysis is necessary.

Help Us Fill in the Blanks

Given the above, we’re making a call to both AI system creators and unaffiliated volunteers to help us fill in the gaps in the system reviews started by our valiant reviewers.

  • AI System Creators If you see your system in the list above, and see blanks in your system’s spreadsheet, please comment or DM me to help us fill them in. If you don’t see your system listed above, but you’d like to put it through the review process, as the LLM360 team did recently, please also let us know.

This could even be a permanent solution for the elusive document challenge. As one reviewer suggested, “In the cases when there is cooperation by the organization publishing the model, it would be great if they can fill in the links to the different artifacts…” In that case, “The review is just checking their licenses and verify[ing] that they really are available and correspond to the artifact category.”

  • Independent Volunteers We’d also love to have the help of unaffiliated folks. If you did not create an AI model, but are very knowledgable about it, please do speak up can help us fill in the blanks. Raise your hand by commenting or DMing me.

Let’s see how many of the above questions become clearer – or murkier – once those most familiar with the systems identify the legal documents that describe each component. More to come…

3 Likes

I’m curious about “sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data.”

Does the validation check for this? Maybe it’s already been established that given the required elements on the sheet, a skilled person could accomplish this. I’d be interested to learn about some instances of skilled persons undertaking this.

That’s the idea: the Preferred form to make modifications lists the basic principles that are unlikely to change in the future, while the Checklist below provides a list of components required to comply with the definition of preferred form.

The validation phase is designed to confirm this hypothesis:

The availability of the components:

  • Training methodologies and techniques
  • Training data scope and characteristics
  • Training data provenance (including how data was obtained and selected)
  • Training data labeling procedures, if used
  • Training data cleaning methodology
  • and the required Code components

is “sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data.”

Mer’s report highlights the difficulties encountered by the volunteer reviewers, especially the bullet point “Elusive documents” is telling.

If the volunteer experts in the working group can’t find the components, then we can’t really evaluate the systems and properly test the hypothesis. We need a better way.

Maybe we need to ask the developers of the AI systems to fill in a survey, they’d provide all the details themselves. This may work and I can see people leading Open Source friendly projects like @stellaathena @Danish_Contractor and @vamiller filling in such form. But Meta, Mistral, xAI and such? They won’t and will likely continue to call their systems “open source” because they want to get out of the obligations of the EU AI Act or gain some other market advantage. Maybe that won’t be a big issue because eventually there’ll be enough public pressure to censor those who abuse of the term Open Source AI, as there is public pressure now for those who abuse of Open Source for software.

What do you think?

1 Like

I think it would strengthen the definition if we had more information on instances of people taking all these open model-building components, modifying and rebuilding them – what was involved, here’s the end product, etc.

My thought all throughout has been if OSI implies that that promises of open source that we expect from software don’t carry through to “open source AI,” we’ll be watering down the meaning of open source.

I’m not a kernel developer, but I have patched, compiled and run Linux kernels. It’s not that I need to be able to reassemble a model myself, but I’d love to know that it can happen and is happening with qualifying models – even a toy model example would be useful.

As for the groups claiming open source, the best approach to counter this would be something like, training data aside, the model weights they make available come with use restrictions, so they can’t even begin to be considered open source. OSI might have a chat w/ each journalist who repeats (or in many cases, themselves introduces) these claims.

1 Like

Thank you, Mer, and everyone for posting this thorough summary of our efforts.

For community transparency, I will share here our review of the review notes and feedback, including points highlighted above:

Hello Stef and Mer,

Thank you again for allowing our review team a little extra time to understand the nuances of what we were working on and the work that has been refined by the working group to create the latest version. We have submitted our notes in the shared team document.

We wanted to provide some feedback about the review process as it exists now. We’re scoping this specifically to Abdoulaye’s and my review of OLMo from AI2. We know we have other threads to catch up on with you, but wanted to give you notes about our experience sooner than later.

Things that worked well for us

  • This model was a recent launch - blogs consolidating information, and their source repositories, were all mostly up-to-date and easier to link across sources.
  • It was VERY helpful to have a shared spreadsheet with other reviewers’ notes and guidance. This helped us understand where people had come from before in their work, and where others might have similar questions.
  • We really appreciate the email thread between Mer and the other reviewers. This allowed us to ask questions we weren’t sure about, and see what general guidance was being given to the group. This also kept the conversation “in the room” to the people who had volunteered, as we sorted out our understanding of the work, and didn’t open up these questions to scrutiny from non-volunteers.

Things that were challenges

  • The current component/system framework is not easily contained or bounded by current platforms or technology frameworks. This presents a challenge for reviewers AND developers to track down all the parts AND licenses surrounding a system.
    • tl;dr > There is no “one and done” place to see artifacts, licenses, and terms & conditions attached to each component, or to understand what could apply to the system as a whole.
  • The current component framework, even with the reference paper, was not clear for us which type of artifact should fall under each component. For example, in our notes, you’ll see questions like “Is this talking about the publication, the white paper, the source code, ….”
    • tl;dr > This potential compounding of different kinds of artifacts together made it challenging to track down the legal document for a specific component, when we couldn’t determine which one applied or if there was any at all (in some cases - such as where white papers were listed without a copyright in the directory OR the paper.)

Where we felt blocked

  • We did not feel equipped to make a decision about whether a copyright license attached to anything other than software (and even these!) were “compliant” or “conformant” with the OSD.
    • tl;dr > We would appreciate guidance from the OSI’s licensing committee, maybe in partnership with groups like Creative Commons, to develop an initial list for reviewers for artifacts that are not software.

cheers,
amanda casari
Google Open Source OSS+AI Lead

4 Likes

Here’s where we are with system analysis, given our incomplete legal documentation. If you can help us fill in the blanks with more complete system information, please let us know (comments or DM). Thanks!

Thanks for calling this out, I hadn’t read it yet. I like the class-based approach. Do you know if this model intended to be a part of the official open source AI definition?

@jasonbrooks the Model Openness Framework is indeed referenced in the draft Open Source AI Definition, the Checklist is based on it and the paper is linked from the “note” in the draft.

And no, the Open Source AI Definition will be binary, not a range of open.

Thanks to @nick for pointing us to the opening-up-chatgpt.github.io LLM openness list. Based on this information, we are now categorizing Mistral (Mixtral 8x7B) as not in alignment with the OSAID because its data pre-processing code is not released under an OSI-approved license.

Again, if we are missing information on this or other systems, please let us know.

A post was merged into an existing topic: Open Source AI needs to require data to be viable

I’d like to audit Silo AI’s Viking model: Silo AI releases the first checkpoints of Viking, an open LLM for all Nordic languages, English and programming languages.

They claim it’s Open Source under the Apache 2.0 license, so I’d like to do a thorough analysis about the actual openness of the model. How can I get involved?

Note: the latest OSI email said I should volunteer in this thread

Hi @merlijn-sebrechts. Thanks for your interest. Right now our process is to have each system reviewed by at least 1 person who is unaffiliated with the system (neither a creator nor an advisor). If you are unaffiliated with the Viking model, I can set you up with a review sheet in our public review deck to get started.

Although at least 1 independent reviewer is required, we are finding that creator collaboration is usually also necessary. As this post describes, it has been challenging to find all required documentation without creator involvement. A collaboration between system creators and independent reviewer(s) seems to be the most reliable recipe for completing a system review.

Do you have a contact at Silo AI who would be able to help you find the documents you need to complete the review?

I’m not affiliated with Silo AI. Feel free to set up a review sheet for me.

I don’t have a contact, but I can message them and see if I get a response.

Okay, I’ve set up the review sheet for Viking here: OSI: AI Systems Review Workgroups - Google Sheets. Please DM me your email address so I can give you edit access.

If you have questions during the process, you can leave comments in the sheet and I will respond there directly (see the Arctic sheet for an example). Thanks for volunteering!

In the case of MLOps are they considered included in this topic?

Hi @anatta8538, I think that’s an interesting question. We are evaluating solely the components as described in the Model Openness Framework, which explores how these components are developed and to certain extent how they are deployed. I believe the scope would be much wider if we were to consider MLOps as a whole.

2 Likes

Will CLIP, which is an LMM, be consistent with this OSAID?

@Mer was there a final(ish) writeup of the Validation exercise, outside of the newest draft itself and/or the Google Sheet? Thanks!

@luis, thanks for checking in. We weren’t able to make much more progress without the participation of system creators. Here’s a screenshot of the current version:

Thanks for the prompt response, Mer. Still seems like a very worthwhile effort.