Open Source AI needs to require data to be viable

Mark · June 4, 2024, 12:31pm

Yes, fair point on that, @moodler and @stefano. I hadn’t seen Downing’s piece, it is very useful indeed. I fear I must leave some of my idealistic notions behind in this context.

I am okay with BloomZ not counting as OSI-open on account of their not using an OSI-open license like RAIL. I think my main concern, in line with the OP of this thread, is that meaningful forms of openness relate more to data availability and disclosure of training and source code than to the licensing of a single derivative component.

E.g. Mistral can merrily release their Mistral 7B or Mixtral model weights under Apache 2.0 (and benefit from open-washing) but if none of the upstream data and code is available, that is not a very meaningful form of openness, as it doesn’t allow users to exercise several of the fundamental freedoms (1 and 3). Model weights are the most inscrutable component of current generative AI, and providers that release only those should not get a free ‘openness’ pass. (In this I am in agreement with @juliaferraioli’s OP.)

In contrast, if a player like BigScience shares source code, data collection, curation, finetuning, and model architecture, releasing most of this under Apache 2.0 — then anyone with the compute (yes I know this will be very few parties) can actually train their own model and exercise all the freedoms. The choice of BigScience to release their own model weights under RAIL primarily affects freedom 0, but not necessarily the others. I recognize that this, too, may not be sufficient to qualify for OSI-openness. (The idealist in me would like to be able to count it as more open though, if only to incentivise model providers to move towards meaningful openness and away from open-washing.)

zack · June 5, 2024, 1:14pm

Wait, why are you saying this? I’m probably missing an important nuance. In v0.0.8 of the definition, all required components are required to be available under an OSI-approved (or OSD-compliant) license. That rules out field of use restrictions.

Is your point that those requirements are only in the checklist and not in the definition per se?

To be clear: I fully agree with you that field of use restrictions should inhibit being considered “open source AI”. I’m trying to understand why you say the current definition does not implement that properly.

Cheers

pchestek · June 6, 2024, 4:01am

I don’t understand how a gradient will discourage open washing rather than encouraging and facilitating it. By having a variety of meanings for “open source AI” you are giving even more surface area for creating deceptive statements around it.

Mark · June 6, 2024, 6:38am

I don’t really follow, @pchestek , as we don’t propose “a variety of meanings” for the notion “open source AI”. Rather, in our paper, we observe that the current lack of clarity is the result of important dimensions of openness being overlooked (by regulators) or obfuscated (by model providers). I should note that our paper is more descriptive (aiming to make sense of the complexity in a domain) than prescriptive (proposing specific labels).

As the image upthread shows, we note how various possible operationalizations of openness each distort this complex reality in various ways, and some are more vulnerable to open-washing than others.

I agree that it may be desirable to have a strict definition of “open source AI” with a small attack surface. But we can also predict that model providers will try wriggle their way out of this by insisting on watering down the notion (as in “fair trade”) or retreating to broader terms (as in “open”). Purpose of our paper is to sketch this broader landscape and propose dimensions of openness that can help pin down what providers are doing even if they are playing these word games.

jasonbrooks · June 5, 2024, 5:30pm

Ah, thanks, I overlooked that link in the definition.

So it seems that the proposed definition is something like midway between Class II and Class I from the paper, due to more lax training data requirements? Maybe not? Class I feels like what I’d expect from “open source.”

I take some comfort in the “sufficiently detailed” data bit, but wonder about whether it’d actually be possible to “recreate a substantially equivalent system” without providing the data. And I’m uncertain about how to test for something like that.

For any AI system that’s deemed “open source,” an important test will be some organization actually doing this recreation to demonstrate that the promise is real.

stefano · June 6, 2024, 9:24am

That’s exactly what @pchestek is referring to: You’re admitting and even formalizing that there is a degree of “openness” and by doing that you’re playing exactly into the hands of the open washers. Granted, there is no formal definition of Open Source AI yet but we’re pretty close to having one. It’d be great to have one more column to your paper: after showing the availability of components, finally puts a checkmark of whether that system passes or fails the “Open Source AI Definition.”

The OSI has been playing this game for 26 years: There are many many organizations arguing daily that there is a degree of openness and it’s all “open source”. Examples range from the early 2000s of Microsoft Shared Source Initiative to more recent companies and VCs sharing software with agreements that preclude unrestricted use and modifications. It’s an old game that is being replicated in AI. Companies play the game because there is tremendous market value in being considered Open Source. And now with the AI Act excluding “open source AI” from some obligations, it’s even more appetizing to win the open washing game.

Open Source is a binary concept: you’re either granting end users and developers the right to have self-sovereignity over the technology or you’re not. Same with Open Source AI. And sure, there are AI systems out there that will be “almost” Open Source AI. But they’re not, and they all go in the same pile over there with a big FAIL stamp on, there is no “oh but this is slightly better, more open because …” Nope, it goes in the same pile over there: NOT Open Source AI. That’s what the OSI has been entrusted to do by its stakeholders and will continue doing it.

stefano · June 6, 2024, 10:04am

@jasonbrooks I moved your comment here because it’s more pertinent to this thread.

I’m not surprised you “feel” that way but AI is radically different and I don’t expect to see anybody running the test you suggest.

Nobody is rebuilding from scratch OLMo or Pythia just because they want to replicate the build before shipping it to their users (like Debian does for its software packages.) It makes no sense to do so: retraining a system is not going to generate an identical system anyway, and it’s guaranteed to cost money and time without even generating academic points.

What we’re actually seeing are people finetuning trained models to build new systems/applications (like this one or this one) or re-training from scratch, but with very different reasons than what software packagers/distributions do. Training is done to build new systems that improve the performance of existing ones: that’s useful. I can see a reason to re-train also to fix bugs and security issues, when the cost of mitigation is superior to the cost of retraining. But I don’t expect to see anybody retraining like Debian does for software packages.

Mark · June 6, 2024, 10:10am

Stefano, I think the goal of OSI as sketched here by you is eminently reasonable and laudable. For legislatory and regulatory purposes, we probably need that kind of clarity, and I will again note that I’m in agreement with the OP that the most viable version would be one that requires data.

One aim of empirical efforts like ours is to help make visible the amount of work that still needs to be done in order for model providers to clear that kind of bar. I don’t see how calling out open-washing and making visible the precise openness dimensions open-washers are trying to obfuscate is “playing into their hands”. I’m hoping we agree Llama3, Mistral and the like, with their strategic lack of clarity about training and fine-tuning data, definitely don’t qualify as “open source” and are “at best open weights”, as we write in our paper. Sunlight is the best desinfectant. But then again I haven’t been at this for 26 years yet, and may be too optimistic.

It’d be great to have one more column to your paper: after showing the availability of components, finally puts a checkmark of whether that system passes or fails the “Open Source AI Definition.”

All our data is available and the kind of additional column you mention would be fairly easy to realize. It would be like method 4 in our Figure (where a small slice of models survives a dichotomous open/not open decision).

there is no “oh but this is slightly better, more open because …” Nope, it goes in the same pile over there: NOT Open Source AI

I can definitely see why this makes sense from the OSI point of view. Our interests are broadly aligned but don’t fully overlap. For instance, for academic research and educational purposes, very open models are useful and important to keep track of, even if they have restrictive licensing; which is another reason our index exists.

stefano · June 6, 2024, 11:52am

Thanks @Mark I think we’re in agreement with the “open” debate. Let’s get back on topic:

Sorry if I sound like a broken records but we need absolute precision otherwise this becomes a debate about the gender of angels. What do you mean exactly by “requires data”? I don’t think that what the OP suggests is reflected in the criteria you used to assign green marks in your paper. Please correct me where I’m wrong.

Take Pythia-Chat-Base-7, a fine-tuned system you’ve reviewed based on EleutherAI’s Pythia 7B, which we’ve reviewed.

Pythia-Chat-Base-7 reveals the training dataset is OpenDataHub. That repository shows that the data is shared as links to Google Drives (if I read the code correctly). If one is interested in retraining Pythia-Chat-Base-7 (why would one want to do that is not for me to ask at this moment, but I reserve the right to ask later) one would have to go to OpenDataHub and get the datasets. But there is also the chance that the data is not there anymore, as links go stale.

If I understand @juliaferraioli’s argument (please chime in), this sort of disclosure isn’t satisfactory of her requirements because links to Google Drives are not versioned, there is no easy way of forking the dataset, there is no hash of the exact dataset used for the finetuning. Quoting from above:

And yet, in your paper the LLM Data box for Pythia-Chat-Base-7 is green.

Similar issue for OLMo 7B Instruct, the model card says:

For training data details, please see the Dolma, Tulu 2, and UltraFeedback documentation.

Which version of Tulu 2 and UltraFeedback was used? That would fail Julia’s criteria, too… And that’s the top scoring model.

Now, even more complicated: the ancestor of Pythia 7B is built on the the Pile dataset. Pythia-7B is one of the most well documented model out there, code, data, papers, community… all open. The dataset it’s built on was public, too but the Pile was subject to a takedown request for alleged copyright violation and is not available anymore.

The Pile is no longer hosted or distributed by The-Eye or afils.
It is however still shared among the AI/ML community presumably for its historical significance and so can still be found via public torrents.

Hashes of the original dataset can be obtained from eleuther.ai at request.

Now what do we do? If I can imagine of an Open Source AI, Pythia-7B must be one. It probably even matched @juliaferraioli’s high bar for a few months. But now, what’s Pythia-7B today?

And yet, all of these systems are at the top of the charts for transparency, and reproducibility. If you look at the preliminary results of the validation phase, Pythia and OLMo both would be Open Source AI, while Falcon, Grok, Llama, Mistral would not because they don’t share Data information.

Is the draft Open Source AI Definition working as intended? Can we please get more eyeballs to review systems?

shujisado · June 6, 2024, 2:57pm

In the past few days, I have been continuously contemplating the validity of Julia-san’s proposal. I can emotionally understand the need for high standards for datasets. However, Stefan-san’s counterargument is reasonable and aligns with OSI’s goals. We must draw a realistic line.

I finally understood the incident involving the Pile dataset, and I realized that almost the same thing could happen in Japan. Japanese copyright law considers training on the original Pile dataset to be legal, but the distribution of the dataset could be regulated by laws other than copyright law.

spotaws · June 6, 2024, 4:46pm

So many points here. First of all, if necessary source code is removed from a project (for whatever reason), that is a serious concern for the project, whether it is an AI system or a XML library.

I also find it interesting that you bring up specific AI systems and imply that “Data information” is the only reason they wouldn’t meet the standard.

Falcon 2 uses a modified Apache-2.0 license that has added use restrictions, and a forced advertising clause. It is highly unlikely this is an open source license.

Llama’s license has use restrictions and again, is highly unlikely to be an open source license.

Some Mistral implementations (notably 7B) are under Apache-2.0, but the corresponding source code is… missing. GitHub - mistralai/mistral-inference: Official inference library for Mistral models is not the source code.

I would argue that OLMo should fail the definition whether we use the loophole “Data information” or require an open data set, because the Allen Institute for AI (the upstream for OLMo) has release the training data for it (Dolma) under a license (ImpACT MR) that has significant use restrictions and does not meet any definition of open data that I am aware of. If Data Information can be interpreted to mean “data you cannot use” or “data you will never have access to”, how can it possibly be a dependency for an Open Source AI system?

If the subtext here is that “there are not enough good open datasets for training AI systems”, then let’s invest time and money in that instead of lowering the bar, which will have a chilling effect broader than AI.

nick · June 6, 2024, 5:02pm

I think we can all agree that Falcon, Llama, and Mistral do not meet the OSAID requirements, and this has been highlighted by the analysis from the Working Groups being led by Mer:

Now OLMO is an interesting case because they recently made a change to the Dolma dataset license and made a huge deal about it:

Dolma was under the ImpACT license and now is using ODC-By.

It’s likely that OLMO does meet the OSAID requirements.

juliaferraioli · June 6, 2024, 5:56pm

@zack and @shujisado on field of use restrictions – individual components may be covered by their individual licenses (depending on what conformant/compliant wind up meaning), but the overall system may be subject to additional terms, which is why we need this to be explicit. See explanation in the parent post.

Nobody is rebuilding from scratch OLMo or Pythia just because they want to replicate the build before shipping it to their users (like Debian does for its software packages.) It makes no sense to do so: retraining a system is not going to generate an identical system anyway, and it’s guaranteed to cost money and time without even generating academic points.

I vehemently disagree with your evaluation here, @stefano. Nobody may be doing it right now, but for broader adoption it is critical to have the capability to do so. Just because you don’t see a use here, doesn’t mean there isn’t one. We’re talking about statistical models. Even if the resulting system is not 100% line for line identical, the resulting system should produce the same results from a statistical standpoint given the same parameters.

Is the draft Open Source AI Definition working as intended?

It still is not clear how it is intended to work. I hear conflicting information all over this thread. Until the criteria are concrete, then exercises to evaluate systems are subject to extensive subjective interpretation, meaning you cannot normalize their results against each other.

Please correct me if I’m missing some overarching point here, but the counterargument against requiring data seems to be “it’s too hard”?

I echo @spotaws’s comment:

If the subtext here is that “there are not enough good open datasets for training AI systems”, then let’s invest time and money in that instead of lowering the bar, which will have a chilling effect broader than AI.

Great call out @nick on the license change for dolma. Given that, I do think that OLMo would qualify as open source based on the criteria that I posit is necessary. Pythia still would not as the data are not licensed.

stellaathena · June 6, 2024, 9:18pm

Dolma contains data that is not being used in a fashion consistent with its license, just like the Pile does. They didn’t go through C4 and validate the licensing of everything in it.

stellaathena · June 6, 2024, 9:20pm

The training data for the Pythia models can be found (fully preprocessed, including shuffling and tokenization) here.

Mark · June 7, 2024, 7:23am

@stefano writes:

What do you mean exactly by “requires data”? I don’t think that what the OP suggests is reflected in the criteria you used to assign green marks in your paper. Please correct me where I’m wrong.

You’re not wrong. As we note in our FAccT paper: “Our survey stays relatively superficial when it comes to assessing exactly how open the training data of a system is.” We mention multiple reasons, I’ll drop an image here in case you want to read but TL;DR we realise the challenges and our survey does indeed only scratch the surface when it comes to data availability.

We’ve considered a more fine-grained system of openness dimensions where we distinguish between data disclosure/description, data availability, and data licensing, but our team is not large enough and it is too much of sidequest for us (we are language scientists with an interest in technology).

Here the prescriptive vs descriptive distinction comes up again. Our project is descriptive, and as scientists we’re not too bothered by restrictive/responsible licenses — our priority is to be able to understand and tinker with these systems, use them in teaching critical AI literacy, etc.

OTOH for regulation and for countering open-washing there is a need for more prescriptive clarity. That’s why we are very appreciative of the way these challenges are being tackled here and for instance in the Model Openness Framework. This is why I support @juliaferraioli’s point in OP and above that availability of data is crucial in the context of a strong and viable open source ecology.

Mark · June 7, 2024, 7:44am

I want to register my agreement with this — even if retraining isn’t common now, compute is rapidly scaling up and there are incentives to develop retraining capability. RedPajama is an actual example of an attempt to do so in order to replicate Llama (severely hampered of course by the tactical non-disclosure of dataset information by Meta).

I’ll add two other reasons that data availability is important: auditability and explainability, both of them often figuring in discussions of the strenghts and security of open source (even in ‘ancient’ times like 2001). E.g., important work auditing datasets by folks like Abeba Birhane, Vinay Prabhu and others is only possible when datasets are open. As for explainability, scientists can only truly understand and probe LLM behaviour if they can inspect underlying data. The general public would be able to adjust their priors about how amazing it is that bar exams are being passed by some LLM if they can know what the training data was like.

Initiatives like What’s In My Big Data? are showing that such auditing and explanation is possible and feasible when datasets are open and available. They find that “several datasets used for benchmarking models trained on such corpora are contaminated with respect to important benchmarks, including the Winograd Schema Challenge and parts of GLUE and SuperGLUE” (!).

stefano · June 7, 2024, 12:32pm

OK, I think the positions are pretty clear and I’ll take some time to summarize this debate so that more people can chime in.

I think this example you give is actually proving my point: RedPajama is not a rebuild of Llama to prove that you were given something like the “complete and corresponding source code” (as @juliaferraioli requires). It’s an attempt to build a new model, functionally similar to Llama, but with a permissive license. It’s the same exercise @stellaathena and others at EleutherAI did with when OpenAI released their GPT.

I’d love to have a separate thread to hear what the incentives are to retrain models from scratch.

zack · June 8, 2024, 7:15am

As a data point for this discussion, I’d like to point out the example of StarCoder2, which hasn’t been reviewed against the criteria of OSAID v0.0.8. To the best of my knowledge, it’s an LLM for code that might pass all the criteria of openness about its training (starting dataset included) suggested in this thread.

It would still not be OSAID-compliant, mostly because the final model license (which is a RAIL variant, as the vast majority of models on HuggingFace). But that is for me a separate discussion from the data one. In particular: whether to apply an “ethical” license to the final weight is fully a policy decision, whereas liberating the entire training data and toolchain is also a technical one. StarCoder2 in my opinion shows that it is technically feasible to do so.

Ezequiel_Lanza · June 8, 2024, 4:09pm

Hey everyone,
I’ve really enjoyed reading through this insightful thread, and I’d like to share my thoughts.

I’m leaning towards the data-information approach, which leads me to this key question:

What would be the preferred form to make modifications to the system ?

Ideally, as a Data Scientist, I’d say having the dataset available is essential because we would need to understand the data before ingesting to our models and how to pre process it to teach the model to have the behaviour we want. However, I recognize that the situation might differ for LLMs.
Arguments for data availability often cite explainability, security, and transparency. Yet, these concerns aren’t necessarily resolved just by sharing the data, they are aimed to be addressed through implementing proper safeguards and different methodologies as https://arxiv.org/pdf/2308.13387 mentioning that the datasets involved are extremelly extense and with copyright data on it (maybe). Another point is that studies confirm that we can unlearn specific segments on the LLM without having access to the initial data : https://arxiv.org/pdf/2310.02238. And more important we can see the LLM “world” is gaining popularity on trechniques to augment or contextualize the knowledge of the LLM , like Fine-Tunning, RAG, RAFT, or many other approaches that are benefited of having access to the model and not specifically the initial data set.

Given the architecture of LLMs (transformers), even a perfectly curated dataset does not guarantee the elimination of issues like hallucinations or undesired outcomes.

Therefore, I believe that sharing the dataset is not necessarily required and may not justify the potential risks associated with making it mandatory as @stefano mentioned. Importantly, this approach wouldn’t jeopardize the open-source spirit or stop any future development. By utilizing alternative methods and maintaining transparency in how models are built and trained, we can uphold open-source principles without compromising data security or integrity.

Can having information about the initial dataset be useful? Absolutely! Do I need direct access to it? Ideally, yes, but in practice, it can be avoided.

In conclusion, in my opinion, having the entire dataset is not always a preferred form to make modifications to the system as I think it’s mandatory to share the architecture and the weights!. Study and use can often be accomplished through indirect methods like model introspection and interpretability tools. Modifying and sharing outputs can be effectively managed with advanced techniques such as fine-tuning and Retrieval-Augmented Generation (RAG) without the need for the original dataset (we still need to have the data information to see if we need to “inject” external data in the way RAG or Fine-tunning methods does).

Ultimately, while the dataset can enhance these freedoms, especially for transparency and reproducibility, the evolving landscape of machine learning shows that we can often achieve the same goals through alternative methods without compromising security or intellectual property. Thus, in my opinion, it’s feasible to keep the open-source spirit without needing the complete intial dataset, particularly for Large Language Models (LLMs).