Report on working group recommendations

Mer · March 1, 2024, 5:35pm

Overview

As part of this definitional co-design process, four working groups were convened in January and February to vote on which components should be required for an AI system to be considered “open source” according to the Open Source AI Definition.

To make the working groups as global and representative as possible, we conducted focused outreach to increase inclusion of Black, Indigenous, and other People of Color, particularly women and individuals from the Global South. Each working group also included either one or two creators or advisors on the system under discussion to provide technical expertise. The reports from those groups, including member lists and voting results, are below:

Working group members were invited to vote as to whether each component was required to study, use, modify, and/or share that AI system. The votes from all working groups were then compiled to create a total tally of votes. The compiled votes can be viewed on this spreadsheet. I then created a rubric, based on the mean number of votes per component (µ), to create a recommendation associated with each component. The details of that rubric are also in the spreadsheet, in column M.

The spreadsheet and recommendations were shared with all working group members via email. They were also shared publicly at last Friday’s townhall and on Tuesday in a Zoom meeting open to all working group members. The next step emerging from the Tuesday meeting was to share the recommendations in this forum for public comment.

Recommendations

The recommendations below respond to the question:

Should X component be required for an AI system to be licensed as open?

Based on the number of votes for each component across all working groups, the results are as follows:

Required

Training, validation, and testing code
Inference code
Model architecture
Model parameters
Supporting libraries & tools*

Likely Required

Data preprocessing code

Maybe Required

Training datasets
Testing datasets
Usage documentation
Research paper

Likely Not Required

Model card
Evaluation code
Validation datasets
Benchmarking datasets
All other data documentation

Not Required

Data card
Evaluation data
Evaluation results
Model metadata
Sample model outputs
Technical report
Code used to perform inference for benchmark tests

*Includes other libraries or code artifacts that are part of the system, such as tokenizers and hyperparameter search code, if used.

We look forward to reading your thoughts and questions below. Thank you again for being part of this process.

Aspie96 · March 5, 2024, 5:54am

My stance on this issue is the same I stated in Is the definition of "AI system" by the OECD too broad? - #20 by Aspie96. I don’t see why testing code would be required if a program is distributed that doesn’t perform any testing. If one develops a new test for a model they previously released, must they publish that for the model to be open source.

I’m not sure what the “data processing code” would imply. Does that include the code of any software used to produce the dataset that had some effect on the data? Images are cropped, compressed and altered in all sorts of way, including by software internal to cameras. One would probably not describe that as a preprocessing technique but the line seems rather blurred to me.

That said, outcome aside, I’d like to point out that out of 4 working groups, 2 refers to models under proprietary licenses, one of which OSI already complained about for being described as “open source”.
As I pointed out in Report from BLOOM working group - #2 by Aspie96, Mistral has published models under the Apache 2.0 license. I think a Mistral working group, as well as groups from other related open source AI projects would be more aligned with OSS values than groups revolving around proprietary systems, be they open access such as LLaMA and BLOOM or not such as ChatGPT.

As an additional note, this only represents a very partial portion of the landscape at hand. Of 4 groups, 3 refer to LLMs and the other one to a popular library. This misses other kinds of models, including both foundational models and not (such as Open Image Denoise), audio models and so on.

This also seems to only focus to ML systems that produce numerical statistical systems with a (mostly) predetermined structure and a (potentially) large amount of parameters.

stefano · March 5, 2024, 10:34am

We’re well aware of the limitations of the sample size. These were selected to have diversity of approaches:

Pythia is an open science project, with a permissive license
BLOOM is an open science project, with lots of details released but shared with a restrictive license
Llama2 is a commercial project, accompanied by limited amount of science and with a restrictive license
OpenCV is an open source project, with ML components outside of the generative AI space

We could add more systems but we’re not sure what analyzing more of these would add. And don’t forget, we’re on a time-based release schedule.

What new teachings would we have to expect from taking more time to analyze Mistral or Open Image Denoise? Do you expect these findings to change dramatically?

PS we have time to clarify in the discussions of the next draft the meanings of each component.

fontana · March 6, 2024, 12:58am

I had a reaction to this that may be similar to that of @Aspie96, though bear in mind I came to this effort fairly late. When I saw that working groups were analyzing restrictively licensed models, I was puzzled - why would the OSI waste its time studying models that (I assumed) stand no chance of meeting the future OSAID (barring some radical change in licensing terms)? And then I have to admit that I wondered whether this was a signal that the OSI might be open to embracing the idea of (partially?) restrictively-licensed or discriminatorily-licensed AI systems. Like @Aspie96 I wondered why the OSI wasn’t looking at something like Mistral’s Apache-2.0 model releases, which at least might plausibly be able to satisfy the future OSAID purely from a licensing standpoint.

Aspie96 · March 6, 2024, 4:06am

I’m actually not sure of the idea that a diversity of approaches is a good thing if the direction in which one extends the sample size to diversify it is that of proprietary assets and both BLOOM and LLaMA 2 are proprietary.

Mistral would bring the amount of (potentially) open source LLMs from 1 to 2. Open Image Denoise was just an example of the many useful models besides purely foundation models.

I second this.

That said, I just think I have to mention, if the idea that open source data is required for a model to be open source, an idea I have consistently opposed and that I continue to oppose, then not even Pythia is open source.
Yes, they shared the dataset, but the dataset contains entries which are not under open source licenses. So unless sharing something without an open license makes something (else) open source (something I can’t possibly square with the open source tradition), there are 0 open source LLMs actually being represented in the working group under that framework.

Aspie96 · March 6, 2024, 5:12am

In any case, I understand there may not be a way to have other working groups in time.
I’m not sure having time limits with this was ever a good idea, since getting this wrong could end up rather badly, but it is what it is.

That said, I will reach to, at least, the Mistral people suggesting that they comment on this forum because I think we do need some more feedback from people that develop models which have at least some chance of qualifying as “open source”.

stefano · March 6, 2024, 10:05am

Because I’m considering this an exploratory research. This process was designed after we spent months going around in circles discussing vaguely about licenses, models, datasets, data, privacy, copyright, dependencies… That conversation didn’t lead to much. So we designed this process:

Analyze a sample of “AI systems” to identify precisely the required components
For each component of these systems, check their availability and the conditions for use/distribution (the legal documents)
Generalize the findings and complete a checklist for OSI license committee to evaluate legal documents for AI systems (OSAID “feature complete”)
Get endorsements from major stakeholders (RC1)
Keep refining the OSAID, as it gains support from more stakeholders (v. 1.0)
DONE

We’re finishing stage 1. I wanted to make sure we learned about dependencies from a variety of ML systems. The working hypothesis is that by mixing systems of heterogeneous nature we’d have a better understanding of the challenges of practitioners to use, study, modify and share systems of any kind.

The objective of the working groups was to identify the required components to exercise the 4 freedoms. At that stage the licenses are irrelevant. In fact, if you look at the distribution of votes across the 4 systems analyzed, they all have very similar results. That’s why I’m reluctant to analyze more systems: it’s a lot of work that I don’t think will give teach us anything radically new.

Now we have the list of components and now we can start looking at their licenses. We’re moving into step 2.

Please join the town hall this Friday and we can discuss this in more depth.

Aspie96 · March 7, 2024, 5:44am

Personally I don’t think this actually answers @fontana’s question.

Given the availability of both Mistral (potentially open source) and LLaMA (not even close), why have a working group of LLaMA but not Mistral instead of the other way around?

Even if studying proprietary models is seen as useful, I still find selecting two proprietary LLMs and one open source one to be unbalanced in the wrong direction.

zack · March 7, 2024, 6:58am

About this, it might make sense for step 2 to enlarge the scope of considered example systems, precisely to get a broader idea of the freeness of existing systems. For instance, for code LLMs it might make sense to include Starcoder2/Stack v2, as a “freer” alternative (if you consider all the potential comoponents) to Eleuther and CodeLLama.

stefano · March 7, 2024, 10:01am

to you and @zack : what are you hoping to discover by looking at Mistral that would be so new to radically change the results of the analysis done so far?

why? what’s so radically technically different in the components that make Llama2 and Mistral, Starcoder so different to require more analysis?

zack · March 7, 2024, 1:53pm

Note that I did not suggest having a new working group on StarCoder2 as part of step (1): I agree with you we have nothing new to learn in terms of which components it is made of.

I suggest to look at it as part of step (2), because I believe that in terms of availability it is different from all other “open” LLM for code I am aware of, in particular I believe it is the only one, among those surveyed in part (1) at least, with an “open” training dataset.

zack · March 10, 2024, 12:26pm

Hello all, I have a process question: how will we decide on which side of the “required” bar will the “likely required” / “maybe required” decisions?

Mer · March 11, 2024, 4:47pm

Hi Zack For version 0.0.6 we drew the line after “likely required.” So “maybe required” on down are not required. If you look at the votes in the spreadsheet linked in this post, there’s a pretty big drop-off between data pre-processing code and the data sets and documentation in the “maybe” category, so this felt like a reasonable place to draw the line.