Open Source AI needs to require data to be viable

I’ve reread your paper and I don’t think we’re too far apart. Your premise is that we need to have a binary categorization so that one will know whether a system is entitled to the benefits given to open source systems under the EU AI Act or not. You propose developing a judgment based on a cumulative score. I assume that your concept is that there is some minimum threshold for all metrics. In other words, no one would be able to game the system by being completely closed on one measure but offset it by being super open on another. The OSI’s undertaking is to figure out what the relevant metrics are and what the minimum threshold for each metric is for a system to be considered “open source.” The difference is the OSI is describing that minimum threshold in words, not implementing it in an algorithm.

But I still have some areas of concern about your paper. First, you seem to be saying that RAIL licenses are open source licenses. They are not; they impose field-of-use restrictions.

This is somewhat related to my second disagreement. I don’t follow your chart for how you decide whether something is “open,” “partial,” or “closed.” When it comes to licenses, there is no such thing as “partial” - it complies with the Open Source Definition or it does not, there is no in-between. It instead appears that you are using the word “open” to mean “publicly available,” which are two very different things. For example, under the “RL weights” column, the tool tips say things like “full model weights made available,” “finetuned model available for download,” “instruct version of the model made available but no information on fine-tuning procedure provided.” But this doesn’t inform anyone whether they can use, reproduce, modify or distribute the RL weights. Being “open” assures these rights, not simply that you have access to it. If “open” only means “I can see it,” then every published book would be “open.”

You also don’t seem to have evaluated what the rights are for each of the components. You have only one “license” column, not columns for both whether a component is publicly available and whether there are assurances (licenses, most typically) ensuring that the components can be used by others. While your article is critical of companies that are putting only their models under an open source license, you are encouraging that by having only one license column across five components - “open code,” “LLM data,” “LLM weights,” “RL data,” and “RL weights.” What component does the license cover? As spot points out:

The OSI approach is to require that every piece of the puzzle must be available and it must be under a license, promise, or covenant that ensures others can use, copy, distribute and modify each one of the necessary components. That does not appear to be something you are considering or requiring in your proposal for something to be considered “open.”

System no 2 model makes the most sense to me. I see this as a similar exercise to bibliographies for books - you provide a reference to the data used to create the book but you don’t have to copy the book. In bibliographies, for online content, you provide a URL and list the date when the data was pulled.

It seems very inefficient to try and centralize all the data that was used to train a model. Should the original source data change, then the link can provide access to the most current artifact (point highlighted in this response).

There is more published in book content annually than the scope of the largest, current models. And most of the current models haven’t trained on all the publicly available content (tested their knowledge of National Archives and they have all failed). Trying to replicate these repositories of data is not realistic and not necessary.

1 Like

@stellaathena thanks for raising this point, I have been having similar thoughts when reading about Dolma.

The Dolma page on Huggingface states:

We are releasing this dataset under the terms of ODC-BY.

I have been struggling to understand what they mean by “dataset” that they are licensing - sounds like they are aware that the underlying data is not “open” in a typical sense (free/open licensing, Public Domain, etc.) - they acknowledge that

you are also bound any license agreements and terms of use of the original data sources.

So it’s unclear to me what are they then licensing? - if I understand correctly Dolma is not much more but an aggregate of various existing data sources. Or am I missing something?

In the abovementioned blogpost outlining the license shift, the ODC-BY license is presented as the only license that governs use of the dataset. Which is incorrect, and misleading.

It’s a good example of a lack of clarity on not just what is meant when a dataset is described as open, but what are the proper licensing practices for open datasets.

1 Like

“I don’t want my ‘Open Source’ AI to be correlated with their ‘Open Source’ AI”

To be clear, I am not the right person to make this point,
I know.

Basically, declaring an “Open Source” Cake Mix, I know we aren’t talking about a Cake Mix. Yet, like one, I’m not gonna consider it valid without Ingredients.
Are we going to re-define “Open Source Cake Mixes”, Just so we can sell them in the “Open Source” Store?
NO!!

I’ve been pointed out, that all you need to disclose is the Full Code,
But only describe the Ingredients?!
That’s is crazy!

Although, I somewhat see where it may come from, We should never need to Re-Define “Open Source”. Regulatory detail is a must.
Mind as well Coin a different Term other then “Open Source”. That way my REAL “Open Source” isn’t over shadowed by your FAKE one, Just so we are considered in the same Grading Table.

I could also obviously go into the repercussions of keeping the knowledge where the Data originates secret, as it defines the intelligence.

Not every aspect of a data-set that is important can be appropriately described or adequately interpreted in any form other than the data itself.

What if there’s a problem down the line that is caused because of a lack of data included in the initial data set?.
But nobody would know because the only people who know what the data was to begin with are the ones who kept it from general knowledge.

Nothing about this sounds open.
Yes, you release the weights, but they rely on the data.
Meaning, if you want to cross-check the project, it will never be feasible without the data. In return, the data stated to be used could never be guaranteed.

If you want to keep data away from general public,
Then don’t call it open source.

1 Like

It’s encouraging to see the community come together to hash out our experiences, opinions, concerns, and ideas to move forward.

However, at 45 messages, this thread is now very challenging to walk through to understand current differences, points of contention, paths for mitigation or mediation, or where to analyze holistically for individual or organizational consensus.

Would it be possible for us to consider an alternative method of communication for information sharing and decision making around this specific topic?

@stefano and @Mer, it seems like it would be helpful for the wider community to move this forward quickly and transparently. Would this be in your team’s capacity to organize?

cheers,
amanda
OSS+AI Lead
Google Open Source

1 Like

Hi @amcasari :wave: Yes, this is becoming a very long thread. @stefano is working on a post to synthesize and summarize the main points, which will hopefully allow us to reach greater clarity.

Thank you @Mer and @amcasari! Much appreciated. I think between this thread and discussions on other platforms (which as I recall, was encouraged as a valid way of providing feedback), there are a significant number of people with differing opinions. A more organized approach would be very helpful!

It’s also worth mentioning @mia’s excellent weekly summaries from the discussions happening in this forum:

https://opensource.org/blog/open-source-ai-definition-weekly-update-june-10

Be sure to subscribe to receive her summaries if you haven’t done so already.

1 Like

Thanks everyone for the debate. I have tried to summarize the various proposals on a more manageable blog post. The discussion should move to the new thread

1 Like