Open Source AI needs to require data to be viable

Replicating the data section from this post:

An Open Source AI Definition to achieve its goals of modifiability, an AI system must include the data used to train the system. We are aware of the challenges that this poses for the definition, but the very Model Openness Framework the current definition references states that full transparency and reproducibility requires the release all datasets used to train, validate, test, and benchmark. For AI systems, data is the equivalent of source code, and we explicitly require that source code must be obtainable for software to qualify as open source. The current definition marks these as ā€œoptionalā€.

Where inclusion of datasets poses a privacy or legal risk, we suggest the use of equivalent synthetic data to meet this requirement, where the synthetic data achieves comparable results when training the model.

The required components in the Data Information section are not sufficient for someone to modify an AI system as defined. (Note: modification means changing the system before a model is trained, and therefore is more in-depth than fine-tuning, transfer learning, or similar techniques.) Inclusion of data sets is listed as optional, which means that the section might as well be elided. In fact, there is no requirement than the data used to train an Open Source AI system be licensed under an open license at all, unless the maintainer plans to publish them.

In this, the OSAID fails to meet the necessary high bar to ensure a practical and inclusive standard for Open Source AI. Practically, the OSAID is worded this way so that AI systems can be considered ā€œOpen Source AIā€ without having to publish the dependent data.

We put forth the call to require the inclusion of original training data sets in order to be called ā€œOpen Source AIā€. When thatā€™s not possible for the reasons outlined above, provide the alternative of synthetic data alongside a justification for not releasing the original data sets.

3 Likes

This is a very important topic, confirmed by its recurring nature, thanks @juliaferraioli and AWS Open Source team for bringing this up.

To limit the risk of rehashing arguments already discussed and to help the discussion flow towards a conclusion, I highlight a few things that Iā€™d like people to keep in mind as they contribute to this thread.

1. Looking for the source of AI leads to confusion

Specifically the ML systems weā€™re focusing on have a very different structure than the duo source-binary that weā€™re used to. I highly recommend to abandon the mental map that makes you look for the source of AI (or ML) as that map has been driving us in circles. We have agreed instead that weā€™re looking for the ā€œpreferred form to make modifications to the systemā€, as in ā€œWhat components of ML systems are required to change the output given an input.ā€

2. The law treats data differently than code

Itā€™s pretty clear that the law in most legislations around the world makes it illegal to distribute data, because copyright, privacy and other laws. Itā€™s also not as clear how the law treats datasets, i.e. the difference between distributing Harry Potterā€™s books in epub format or a tokenized version of the saga (unreadable by humans). The laws around the world are being written and court cases in the US are developing: We wonā€™t know for sure for a long time.

Related question: What should the law say about the tokenized versions of the data (the datasets) is such a crucial topic that OSI is financing a workshop to start framing this issue more clearly. If youā€™re interested in this topic, a good primer is Open Futureā€™s paper Towards a Books data commons. Eventually, we will have to form an opinion and educate regulators. This will take time and can go on a separate thread.

3. Re-read the definition of Data information in the draft 0.0.8

The text of draft 0.0.8 is drafted to be vague on purpose to resist the test of time and technology changes. But itā€™s not supposed to be vague on the intention to allow AI/ML practitioners to retrain an Open Source AI model and improve it. Draft 0.0.8 says:

Data information: Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data.

The words come after the workgroups analysis. Theyā€™re also based on the fact that the most open ML systems (like Pythia) cannot distribute the dataset they used to train and most likely there will not be legal clarity for many years. Maybe there will never be clarity once global legislation is factored in (for example, distributing The Pile may be legal in Japan but not in the US.)

4. Provide clear examples

When you criticize the draft, provide specific examples in your question, avoid arguing in abstract. For example, say what if the data information discloses that the model has been trained with Common Corpus, StarCoder and the Reddit data? Would that requirement be considered satisfied? Another example: `What if the data information discloses using StarCoder v1 but a court order forces StarCoder to remove v1 from public access and only v2 can be used, will the model that was Open Source AI lose the qualification?

5. Join the townhall meetings

This topic is hot and we need to discuss this issue more. Join the townhalls to ask for clarifications. The next one is on May 31st https://opensource.org/events/open-source-ai-definition-town-hall-may-31

Questions such as these are indeed helpful to think through the requirements but the draft as it stands is likely to be paradoxically more damaging to incentivizing openness!

I go back to my earlier example ā€“ a model such as BLOOM or StarCoder is not viewed favorably from the perspective of Open Source because it considers the potential harms and includes usage restrictions despite being significantly more transparent, reproducible and thus, ā€œopenā€ than the mistral models.

1 Like

@Danish_Contractor You are missing the fact that there are (at least) two angles to ā€œopen sourceā€ ā€“ the right to use, modify and distribute the subject matter (in software, this is manifested as the grant of a copyright license that gives permission for these actions), and also the ability to use the subject matter for any purpose, not putting restrictions on how anyone may use it. This is OSD 5 and 6. You seem to be advocating for ā€œopen sourceā€ only in the first sense but not the second. The second is equally important.

1 Like

No I understand that ā€“ Iā€™m trying to see if we could incentivize more sharing via the Open Source AI draft; currently this draft effectively just says ā€“ remove usage restrictions from an OpenRAIL-MS license and we have Open Source AI.

And lets say we do that ā€“ if I wanted to modify a given ā€œOpen Source AIā€ system as defined by the draft, in a way that excludes (for example) all wikipedia content (assuming that was part of training data); I cant make that modification.
Is that an acceptable constraint on Openness in Open Source AI ?

Thereā€™s nothing wrong with open sourcing a component of an AI system (weights in this case) as opposed to the full system but then maybe we should make that clearer in the language.

I had made a suggestion here in this thead: Recognising Open Source "Components" of an AI System

1 Like

Iā€™m not following what you are saying. Are you saying that someone who benefited from having access to the full training set, and chooses not to use all of it for their own subsequent model training, wonā€™t have open source AI from having eliminated some of the training set? Or are you saying that if the original source of the dataset says ā€œhere it is, excluding Wikipedia,ā€ that original source isnā€™t in compliance?

An oversimplified example but lets say there are two AI systems 1, and 2.

AI system number 1 shares that it was:
ā€œTrained on data English language data from Wikipediaā€ and it releases weights under OSI compliant license.

AI system number 2 says:
"Here are URLs of articles from Wikipedia we trained on" and releases weights under OSI compliant license.

Two years after both models are released we realize some of the facts returned by the models are incorrect especially on things like politics and law and we want to address that because information is outdated. What do we do?

AI system number 1: We have no option but to give it updated training data and hope that the updated training makes it forgets the old information (we can test on some things but cant thing of every fact). We will effectively train with new data without having a means to compare how it was originally.

AI system number 2: We just look at the articles from the training data that have been updated in the last two years; remove/replace the ones we want and train the model again. This way we make an easy ā€œmodificationā€ to the model by changing things we care about and leaving other things as it was originally. We will effectively train with new data again but this time knowing exactly how the model was trained originally.

Should there be no difference between either models from an open source perspective? Are both systems to be considered open-source AI systems even if one is missing critical components (data)?

That said; releasing something is better than releasing nothing and even in traditional software we have cases where parts of a closed software system are open sourced but the parallels end there. The difference here is that when someone open sourced a component in traditional software you didnā€™t have to do any guesswork or do approximations to make modifications. Should an open source AI system not be held to the same standard?

That brings me back to the motivation of proposing that we recognize the components of an AI system being open sourced and additionally recognizing a full AI system as open source iff all components comply with that definition.

Taking inspiration from the way Open RAIL licenses are named:
AI system no 1 = Open Source-MS Licensed model
AI system no 2= Open Source-DMS (or Open Source AI) because all components are open sourced.

Doing so not only (hopefully) incentivizes developers to achieve full open-ness but it also makes clear when things are partially released (and again ā€“ releasing something is always better than releasing nothing).

2 Likes

I came across this page today in the Whisper model:

https://github.com/openai/whisper/tree/main/data

It might be a nice example to discuss ā€¦ is it sufficient or not to qualify this model as Open Source AI?

(In my opinion FWIW: yes itā€™s enough, assuming itā€™s accurate and complete. To me it feels similar to how experimental science publication works - researchers donā€™t usually release full datasets, but they do need to provide enough information for others to reproduce the experiment).

2 Likes

They seem to be identical from the transparency perspective to me, I donā€™t understand why you consider one not having data.

Consider that draft v.0.0.8 calls not only for the list of URL of Wikipedia articles but also the code used to create the dataset, tokenize it, label, filter, dedupeā€¦ Your example seems too vague to me to give a definitive answer.

We need to be specific, like @moodler just did: point at a repository and letā€™s evaluate.

I am liking this idea of having two or three clearly-defined levels of openness it might be practical for the reality of very large and evolving data sets ā€¦

2 Likes

There are a few relevant efforts that support a degree of openness for AI. For example, the Linux Foundation is likely going to maintain the Model Openness Framework with levels of openness. NextCloud talks about ethical AI rating. There are more.

The Open Source AI Definition must be binary, just like the Open Source Definition for software.

Thanks very much for the links, and for your patience with the many of us who are dropping into all this very much part-time in the middle of 1000 other things, but I am very curious about this red line and how it was drawn. Iā€™m guessing the thinking is to provide a much-needed rock in the shifting sands?

Weā€™ve been following the discussion with great interest. Stefano mentions ā€œa few relevant efforts that support a degree of opennessā€. Ours is among them (see opening-up-chatgpt.github.io) and weā€™ve supported the argument in multiple peer-reviewed papers as well as a contribution last October to one of the OSI deep dives.

I want to share a new FAccTā€™24 paper of ours in which we go into these matters in some detail. I donā€™t want to hijack this thread, but since Stefano pointed me to it and invited me to share this here I take the liberty of providing a quick pointwise summary:

  1. We diagnose open-washing as a key threat of Big AI that hurts the open source ecology
  2. We point out a possible loophole in the EU AI Act that provides ā€œopen sourceā€ models with exemptions (and thereby gives Big AI a strong incentive to open-wash)
  3. We point out the risk of relying exclusively on a (binary) license for open source decisions
  4. We argue for a gradient and composite notion of openness to disincentivize open-washing and other forms of audit-washing.
  5. We draw attention to BloomZā€™s RAIL license and propose this should count towards openness too, or at least should not necessarily detract from it (in line with what @Danish_Contractor says above).
  6. We argue that datasets represent the area that is most lagging behind in openness. A definition that is vague on this risks being toothless and vulnerable to open-washing. (I agree with @juliaferraioli OP concerning this.)

We also question the wisdom of focusing on a binary definition. As we point out:

A licence and its definition forms a single pressure point that will be targeted by corporate lobbies and big companies. The way to subvert this risk is to use a composite and gradient approach to openness.

Hopefully our evidence-based approach, survey of >45 ā€˜openā€™ LLMs, and documentation of actual ways in which companies are already open-washing provides some useful empirical background to the discussions here.

Liesenfeld, Andreas, and Mark Dingemanse. 2024. ā€˜Rethinking Open Source Generative AI: Open-Washing and the EU AI Actā€™. In The 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT ā€™24). Rio de Janeiro, Brazil: ACM. (public PDF in our institutional repository)

2 Likes

@moodler writes:

I am liking this idea of having two or three clearly-defined levels of openness it might be practical for the reality of very large and evolving data sets ā€¦

Specifically in response to this, let me point out that in the aforementioned paper we work out a few non-mutually exclusive ways of turning fine-grained data on openness into more categorical judgements (and we point out the risks of relying on the most reductive approach of all, a hard binary judgement):

Can you elaborate on this one?

In particular, are you suggesting that it should be fine for required parts of AI systems to be released under RAIL license(s), in spite of those licenses containing field of use restrictions? (Which has traditionally being considered not OK by the open source and free software movements.)

Iā€™m asking because the way I read previous comments on this point was exactly the opposite, i.e., we should encourage authors of ā€œopenā€ AI models to drop those restrictions, and go for OSI-approved licenses on those parts, rather than relaxing even more the draft definition so that RAIL-style restrictions become acceptable.

Thanks a lot for your work on this, and for your paper!

Cheers

1 Like

I guess I am not an openness hardliner when it comes to licensing: I value transparency and accountability over some of the freedoms in some cases (what we call ā€˜meaningful opennessā€™ over ā€˜radical opennessā€™ in the paper, contrasted with ā€˜homeopathic opennessā€™ of the open-weights type). E.g., I think providers and users of LLMs should not be free to create oil spills in our information landscape and I think RAIL provides useful guardrails for that.

By ā€˜not detract from opennessā€™ I mean that having components like the model weights under a RAIL license and other components , like data, well-documented and shared under Apache 2.0, as BloomZ does, strikes me as a more desirable situation than the reverse, of which eg Llama is a good example.

I do recognize this issue is partly orthogonal to the question of an OSI license and what it means to accomplish.

I wanted to let this discussion evolve before jumping back in and Iā€™m really happy to see such nuanced perspectives.

@stefano mentions

Blockquote
The law treats data differently than code

Yes, it does. And thatā€™s okay. We arenā€™t constructing a legal definition here. We arenā€™t creating new licenses. Weā€™re creating a definition of what it means, in spirit, to be open source AI. Looking at it from the legal perspective and working backwards from that is not productive. It is artificially limiting, and creates a brittle definition as the law is variable country by country and evolves over time.

He goes on to say that

Blockquote
The text of draft 0.0.8 is drafted to be vague on purpose to resist the test of time and technology changes.

Resisting time and technology changes is good, but being as vague as the current draft is is nearly meaningless. It relies too much on interpretation and judgement calls. My definition of a skilled person is different than anotherā€™s. My understanding of a substantially equivalent system may be wildly different than someone elseā€™s. The same goes for same or similar data.

If same or similar data is not freely available and accessible to an arbitrary party, does it fail the test? Or is this an intended loophole?

Examples

@stefano, you requested that I create this thread with this text. I am providing clear feedback on the definition as drafted. Clear examples are difficult, given the intentionally vague language in the draft.

However, I think that Spotā€™s example of the Reddit data set is a good one. You can give all the details of the Reddit data set that your AI system was trained upon, but if everyone has to fork over $$$ to license the data set in order to modify it, then itā€™s not really open source, is it?

@Danish_Contractor posed two scenarios

Blockquote
AI system number 1 shares that it was:
ā€œTrained on data English language data from Wikipediaā€ and it releases weights under OSI compliant license.

Blockquote
AI system number 2 says:
"Here are URLs of articles from Wikipedia we trained on" and releases weights under OSI compliant license.

These are missing extensive information, like the timestamp at which the data were fetched. Those timestamps and therefore versioned data are critical to being able to train the system. Without the versioned data, I wouldnā€™t consider it open. Thereā€™s a reason that when citing a webpage in a paper, you include the ā€œretrieved atā€ date.

@moodler referenced the Whisper model

Blockquote
https://github.com/openai/whisper/tree/main/data

Blockquote
It might be a nice example to discuss ā€¦ is it sufficient or not to qualify this model as Open Source AI?

I think this is a good step. I havenā€™t gone through the work of reproducing it, but assuming it is complete then yes, I believe it would (almost) qualify. Again, I would want retrieved at dates. It isnā€™t as good as including the raw data directly, but that is a high bar.

Levels of openness

From an ideological standpoint, I like the levels of openness concept. From a practical standpoint, I think it would be too complex to administer, understand, and consume. In this, I agree with @stefano.

At times like these, I do wonder if we are over indexing on a definition. If we abstract away the ins and outs, we can consider this like complex software with multiple dependencies. If one dependency/aspect of that software is not open source, then the resulting piece of complex software is not open source.

Like-to-like is always easier to evaluate, but at the end of the day weā€™re dealing with the same issues. Do all (present and necessary) components need to be open to be open (source)? Yes.

If they arenā€™t, then itā€™s not open source.

Field of use restrictions / restrictions

It is my opinion that the definition (if there is one definition) needs to include prohibitions against field of use restrictions, like the OSD has. It currently does not.

3 Likes

You have to realize that this is wishful thinking, though. In a detailed essay, Kate Downing (renowned lawyer) wrote why these licenses, despite their good intentions, are not going to work.

1 Like

Currently, the OSAID specifies four freedoms as requirements for being considered open source AI. These are likely based on the FSFā€™s ā€œfour freedoms.ā€ As I understand it, the meaning of the word ā€œfreedomā€ is ā€œnot restricted,ā€ and the OSD and FSFā€™s four freedoms essentially mean the same thing.

The OSD consists of 10 clauses because they serve as criteria for determining whether a software license is open source. For OSAID, the criteria seem to be entrusted to the checklist. Therefore, I donā€™t find it particularly problematic to explicitly state the four freedoms as principles first. I believe that the principle of ā€œNo Discrimination Against Fields of Endeavorā€ in OSD clause 6 is likely addressed on the checklist side.

However, it might be a good idea to add a clause suggesting that the term ā€œfreedomsā€ encompasses the principles of OSD clauses 5 through 10.

2 Likes

I agree 100% on this. Usage is a huge minefield to attempt to define and enforce from copyright leverage. I know this from experience of seeing how my own software (Moodle) has been used in ways I personally was unhappy with. My experience includes trying to get court judges to understand. ā€œOpen means Openā€ is where we always end up.

1 Like