Training data access

I’ve mentioned this multiple times in the private mailing list. Now bringing it here again becaue this is one of the biggest elephant in the room issue.

I suggest that at least “open access” training data is required to fulfill the “freedom to study and modify”.
Requiring datasets to be open-source under OSI-approved licenses is not quite practical given the current practice.

I understand that people may want to make the requirments more permissive, not requiring an open source AI to allow people to study its training data, in order to make the definition more compatible with the current practice.

I’m strongly against being too permissive on training data access. While open source AI definition is orthogonal to ethics, bias, etc, the access to an AI’s original training dataset is a vital ground for solving those kinds of issues. Consider an open-source large language model trained by company A, where its training datasets contains deliberate badmouth against company B.

How do you investigate and trust such a badmouth language model without being able to peek into its original training dataset?

This is a kindly warning to the discussion group about the risk of being too permissive on training data.

I missed recent online meetings due to busy personal schedule. If this issue is discussed, I’m sorry for the noise. If not resolved, this is my vote.

I really agree about the required access to training data. Not just for control and validation.

What makes open source software popular is the access to the source code. The author has a motivation in sharing due external contributions, the user has a motivation in adoption due improved control and ability to integrate and customize. A binary black box trained model doesn’t provide any benefit to both of them: the author keeps all the charges of training (collect, annotate, compute…) and cannot receive contributions (as the training data are opaque to others), the user has the only ability to use it with no modifications.

A binary trained model with no training data should be considered “open source” no more and no less than any proprietary freeware application.

I’ve detailed this position in this blog post. It is in Italian only, if anyone is interested she can probably translate it on the fly with some online translation service.

Replying quickly to both of you because I deeply sympathize with where you’re coming from: access to data is necessary to understand and modify how many AI systems work. That’s a large boulder and to move the conversation forward, we need to

The next question is:

  • What about AI systems where data is not available, like those trained with federated learning or other privacy-preserving data-sharing techniques?

and more:

  • What does “access to data” mean exactly and why is it necessary?

For example, the question of data is starting to emerge from the ongoing exercise to review specific systems. Check Should testing data be made available to *use* an AI system?

Obviously, I completely agree that for an AI model to be considered open source, the training data must be open. Think of critical applications where a certain number of checks need to be made on the data in order to be able to give it the right level of confidence.

For the models we are currently developing, we consider the model to be a derivative work of the data, which therefore of course has to be open. What’s more, this means that the licence governing the model depends on the licence protecting the data, particularly if the data is for non-commercial use, which implies the same thing for the model trained with this data. I’m aware that this constraint is strong, but it seems to me that it respects the intentions of the authors of the data.

1 Like

I don’t have a settled view on whether access to training data (whatever access might mean) should be seen as necessary for a model to be “open” (though I’m actually skeptical about this). But I want to make one point about practicality: If you insist on “open” training data, at least if “open” means approximately what those of us in the FOSS universe understand it to mean now, then as a practical matter there will be few, if any, open source models other than perhaps some toy examples or research models of limited practical interest and utility. That’s because I believe that for the kinds of models attracting current user and policy interest (e.g., LLMs), it is not possible to train a performant model from scratch entirely from data items known to be under libre terms. Am I misinformed about this?

Even if you relax the historical definition of “open” to embrace, say, licenses that prohibit commercial use, that won’t solve the problem at all. And I hope the OSI is not considering departing from the bedrock principle that “open” cannot mean “commercial use prohibited”.

The insistence on “open” training data also ignores the role played by doctrines like fair use in the US. Open source in the traditional sense has always relied in part on the existence of limits to copyright. Fair use might permit much of the training on copyrighted data items happening today, but such data in general won’t be susceptible to availability under “open” terms. Why shouldn’t open source AI be able to take advantage of fair use and other limits on copyright?

Just echoing here my position from previous discussions.

I consider that full access to the original training data (as well as the training code, although this is off-topic for this thread) is a requirement to exercise the freedom to study an “AI system” based on machine learning. Short of that access, it is not possible to fully implement relevant use cases for transparency, such as analyzing biases efficiently.

Note that to fulfill this requirement alone, one does not need the ability to redistribute the training dataset, just to access it. However, some common current practices (e.g., signing TOS-style agreements before accessing the data), remain problematic and we should probably consider that such data gatekeeping is not acceptable.

I do think that the ability to redistribute the original training dataset should be a requirement for the freedom to modify an ML-based AI system, because short of that the amount of modifications one can do to such a system is much more limited.

Cheers

OLMo ( Open Language Model) was release today as truly open (their emphasis), and it includes:

  • Full pretraining data: The model is built on AI2’s Dolma dataset which features three trillion token open corpus for language model pretraining, including code that produces the training data.
  • Training code and model weights: The OLMo framework includes full model weights for four model variants at the 7B scale, each trained to at least 2T tokens. Inference code, training metrics and training logs are all provided.
  • Evaluation: We’ve released the evaluation suite used in development, complete with 500+ checkpoints per model, from every 1000 steps during the training process and evaluation code under the umbrella of the Catwalk project.

Might be worth studying this one as well.

+1
It seems they did interesting work around the license definition : AI2 ImpACT Licenses
Worth studying

1 Like

Personally, I quite strongly disagree. I’ll post here what I said in my comment to the latest draft (0.0.5).

I think data being available is really important, but considering it as a requirement for a system to qualify as “open source” would be unwise.

First, the availability of anything under proprietary licenses shouldn’t make something closer to an “open source” status.
The logical consequence might seem to require all data to be open source, but that would mean plenty of systems which are already commonly referred to as “open source” wouldn’t be.

Furthermore, the training data and the trained model are two different, separate assets and the model does not necessarily contain much information which is specific to the training data, as the learned features can be much more general.

Whether the model is open source, therefore, should be orthogonal to whether the training data is open source, which I think is also consistent with the OSD.

Instead of requiring training data to be available for a model to qualify as “open source”, a different, higher standard may be needed, to describe a model which is open source AND trained on fully open source data AND well documented.

(I am not a lawyer and this is not legal advice).

But data available under proprietary terms may do very little for user freedom, especially depending on changing laws.

Users who aren’t allowed, themselves, to train on that data, don’t gain anything in terms of their ability to modify the model and, really, even studying it. And if they can’t themselves share the data, they can’t share the model itself as open source either (if we consider it to be a requirement), contradicting the freedom to share.

Even if the user and the model author are under the same jurisdiction, some laws may also only allow certain parties to use certain data, or only allow non-commercial use of certain data. So the fact that the model author was allowed to train on that data doesn’t mean the user will be.

Aren’t those legal questions still quite openly discussed? Indeed, I don’t see how that can be the case whenever the model doesn’t contain enough information which is specific to the training data.

I know it’s not the only interpretation. I also think it’s the least desirable one, for this see this article by Felix Reda.

I think a trained ML model and binary code are quite different.

In the case of a computer program, the source code and the binary code are two representations of the same asset. The program is made by writing human-readable source code, which is always preserved in practice. Sharing a program only as a compiled binary is done deliberately to restrict user freedom. The code is a description of the program in text and its availability makes all the difference in one’s ability to study and modify it.

A trained model and its training data are two different assets. They are not the same thing in two different forms. Sharing data can be extremely demanding and costly and it can also raise legal questions. Trained models are often, by their very nature, hard to study and understand. Data help, but is neither as necessary or as sufficient as source code for a program. Not even close.

Also, one crucial aspect of open source is security. A compiled program can do operations the user doesn’t want. But a trained AI model is just an equation and the user can know exactly what operations (both computer operations and mathematical ones), occur when the model is being used, at every level of the stack.

Please, consider the exception to copyright laws in EU countries provided by Directive (EU) 2019/790 at article 4. Unlike article 3, which only applies to certain organizations, article 4 only allows data retention for as long as needed for the purpose of mining.

In essence, if there is a requirement to make data available for a system to qualify as “open source”, it seems to me that perfectly legal AI systems couldn’t take advantage of this crucial exception to copyright and qualify as “open source”, even if the license of the model is. This seems extremely restrictive to me.

In addition to everything else, when datasets are huge, how is the author even supposed to provide access? The model may be significantly smaller than the data and, while it’s trivial to always carry source code along with binary code, providing training data every time a model is included in bigger programs can be quite hard.

1 Like

I think @Aspie96 hits the issue very well

also great:

But, I had more conversations on the data issue around the FOSDEM weekend and I had a sudden revelation after one conversation with a ML creator. As we chatted, he clarified what he meant by saying “we need the data”: For his projects, he simply doesn’t have good data. Assembling a good dataset is tedious, time consuming, expensive and legally uncertain. Distributing a good dataset is even more complicated.

As I asked more questions, he clarified that he doesn’t need the original dataset to modify Llama2 or Whisper. He just wish he had more large, high-quality datasets to train and finetune his own systems, built on top of other foundation models.

This specific conversation left me wondering if we’re asking the wrong question. Given the legal status of data, I can’t see how we’re going to make the original dataset a dependency of a trained model and also have any meaningful definition of Open Source AI.

Maybe what AI creators and researchers are telling us when they ask for the data is that we should really have a way to force or incentivize companies to release their datasets, independently from the models?

Thank you, Stefano.

There is something I’d like to add.

“Source code”, in the context of FLOSS, is typically defined as the preferred form of an assets for making modifications to it.
This doesn’t mean making modifications will be easy, just that the source is the best representation for modifications among those that do exist. It’s just a form of that asset, it doesn’t include other assets.

Training data is more akin to documentation than source code (although not truly analogous to that either). And while good documentation is really important, withholding documentation doesn’t make a program non-FLOSS. Indeed, documentation can be crucial for studying and modifying a really complex program, yet none argues an undocumented program can’t be open source.

If an asset is available under an open license in the form of “source code” it’s always been defined as “free”/“open”, both in the context of software and more broadly for other kinds of works. I feel breaking from this tradition in the field of AI would create an inconsistency we do not want.

In the case of a ML model, a representation of the model itself in a format which can be read using open source software and doesn’t carry purposeful technical restriction is “source code”, regardless of how hard modifying or reproducing the model is.

That said, and I’ll repeat myself here, data being available is really important.
In fact, personally, I hold the stance that data should be in the public domain, not just under an open license.

I’m going out of topic with this, but I think a higher, more demanding standard could be that:

  • The model is open source.
  • All training data is provided.
  • The code necessary to run, reproduce (by retraining from scratch) and modify (such as trough fine-tuning) the model is provided.
  • Everything is sufficiently documented.
  • All assets other than the model (including documentation) are also under open licenses and all licenses (including that of the model) are compatible with each other.

When it comes to data, the legal status of a dataset (as a collection of independent works) and individual entries in the dataset are independent. Both should come under compatible open licenses and/or, ideally, be in the public domain.

I don’t know what the name of such a standard should be, but it shouldn’t be “open source”.

I’m not sure that’s what they mean. I think often people do want the data that was used to train a specific model to be available. I also think what we should force companies to do is ultimately a political question, regardless of what shared definitions we use.

I should also mention that, of course, not all of AI is machine learning. We also have rule-based symbolic systems. Those can be hardcoded by hand and the author may very well “train” it by using some data. Must that be available for that program to be open source? This requirement would never apply to any non-AI program which the author geared to work well with a bunch of available assets.

Even among ML models, many may have little to do with any specific dataset they were trained on, if they end up representing a domain well enough. Think an image denoiser or a POS-tagger.

1 Like

As a perfectly legal software application can be distributed even if it is not “open source”…

In 1998, the idea to distribute source code along the software (with also all the rights to reuse and modify it!) was considered a totally crazy thing. Today, it is an industry standard. And the term “open source” is synonymous with “trust” exactly because it requires conditions that someone still consider “extremely restrictive” (the recent “not-so-open-source” release of GitButler provides us the latest example).

There are limits. There are implications. Not everything can stay in a public dataset, first of all personal data. This should be fine, and an accepted limitation. Until the aim of the “Open Source AI Definition” is to provide a catch-all marketing buzzword to attach to every free-as-in-beer binary trained model.

Yes, but in this specific case it would mean that the open source community would miss the benefits of copyright exception otherwise very well aligned with software freedom by introducing a novel requirement to provide access to a separate asset when distributing a specific kind of software.

“Free-as-in-beer” is not related to this definition or this conversation at all.

Furthermore, a trained model is not binary code. It’s not even analogous to binary code. In fact, I already addressed what concerns the kind of representation of the AI system in my comments to the draft.

In a ML the mathematical and computer operations that will be performed are exactly defined, and exactly the same, whether training data is known or not.
ML models are sometimes defined as “black boxes”, but that’s regardless of whether you have the training data. Training data doesn’t clarify the “logic” in the decisions of the systems, which often, and actually most of the time for complex models, remains largely obscure.
My personal suspicion is that often there may not be anything which is both higher level and more intelligible than what’s clearly going on mathematically, but that’s besides the point.
The point is that, for the intelligibility of a ML system, training data is not even close to what source code is for a program. Source code describes the program in full detail in a human-readable language, while also being a form of the program itself. Training data is a separate asset, somewhat useful to understand the model, but not nearly as crucial nor as sufficient as source code.

The same goes for modifications. The original training data can be useful, but it’s neither necessary nor something that would allow for direct modifications the same way that source code does.

We already have a meaning to the phrase “source code”, in FLOSS, for things that aren’t written in text nor compiled as binary. Introducing a novel condition would be bad in practice, because it would fundamentally exclude many crucial kinds of models and would be even worse in principle because it would be inconsistent with the previous “free”/“open” definitions.

I’d like to expand slightly on this point by @stefano, which I really agree with.

When distributing an open source program, especially one under a (strong) copyleft license, such as the GPL, one must be prepared to provide access to the full corresponding source code.
Linking to a repo by someone else, while distributing binaries independently yourself, may not be enough, because that repo may be taken down. This hasn’t gotten much better over time, especially for not very well-known digital assets. So you must have some way to provide source code to all to whom you give binaries.

If the training data of a program is made available publicly under an open license, and I think it should be whenever possible, it’s not just the original authors that need to pass that (possibly large) dataset on.

Everyone who wishes to include the model in an AI program and “certify” that program as open source would also have to make sure to provide access to the training data. This seems, to me, to be more of an obstacle to software freedom than an advantage.

This isn’t an abstract or pedantic concern. Datasets absolutely could be taken down. In practice a link to the data may be enough, but I don’t see how this can be implemented as a requirement in a way which isn’t sloppy.

This isn’t anywhere close to being my main reason for thinking that access to training data mustn’t be part of the open source AI definition, it’s just an afterthought. The pragmatic difficulties may be more than I originally thought.

@fontana I appreciate your distillation. I think of it as three pieces - software, model weights and parameters, and training data. That tracks with the current checklist of components, “code,” “model,” “data” and “other” (and I have a separate comment questioning whether there should be an “other”). So I have one more technical characteristic that is a distinction from software than you have.

That the software must meet the OSD is a no-brainer. For weights and parameters, I think the currently existing open source licenses conceptually are right, but there is a license enforcement problem with them that has to be solved if the weights and parameters aren’t copyrightable. But that’s not a definition problem, that’s an implementation problem.

This leaves the data, which is the hardest piece to me. If we assume that the training data is required for full transparency and reproducibility, what do we do about that? It has the same license enforcement problem as with weights/parameters, but what’s more challenging is that it also may contain information that it would be unlawful to provide access to (personally identifiable information, health data, financial data, copyrighted works). If we require access to data to call an AI system “open source,” open source AI may be a null set.

But data (like hardware) is typically outside the remit of the OSD - it just is a different animal. The EU AI act exclusion is for “AI models that are made accessible to the public under a free and open-source license whose parameters, including the weights, the information on the model architecture, and the information on model usage, are made publicly available.” Ihttp://www.openfuture.eu/wp-content/uploads/2023/12/231206GPAI_Compromise_proposalv4.pdf It doesn’t consider the training data as part of the system.

I’m leaning in the direction that data is just outside of scope. Why isn’t the AI system just the software to be applied to data and its resulting weights and parameters, why does it have to include the data too? The data isn’t needed to study the weights and parameters (or to to copy, modify and distribute them), only to reproduce them, but neither the Four Freedoms nor the OSD require that you be able to replicate the build. By analogy, the OSD doesn’t require that one must be able to faithfully replicate the built software. By requiring data, we’re elevating reproducibility above all else.

Moreover, saying that the data is required as part of the system will be to ratify that a particular product (a given foundational model, for example) is open source. But the OSI has never opined on whether a particular software program is open source.

So all of that is to say that I agree with you. I would toss data out, and the only problem left is to refine the OSD to include weights and parameters. There may be an implementation problem in coming up with a license that is enforceable for non-copyrightable content, but the OSI has never written licenses, only approved them. I have no doubt that it’s do-able, but it’s up to the community to write the license that accomplishes it. And the failure to accomplish it may still meet the OSD - if the content isn’t protected by a proprietary rights scheme, which is the hook for the license, everyone gets to use it as long as they have access.

But the original owner is likely not going to be able to distribute the data to the user in the first instance, because of the restrictions that @fontana identified.

What does that mean? That still sounds like raw data that has been processed, not the original data.

I make a similar point on a different thread, Is the definition of "AI system" by the OECD too broad? - #5

The main argument for including data in the definition is that they are part of the preferred form for modification of an AI system. We can debate (ideally, together with the actors who are actually creating and modifying AI systems) whether it is true or not that training data is part of that “preferred form”, but it is undeniable that if they are, then the definition of Open Source AI must include them.

(Otherwise, you would have the analogous of a binary executable in ELF format under an open source license, with no source code available for it.)

1 Like

Ah, ok, thanks. If that’s the justification, then in my view creating an unachievable definition for an “open source AI” by requiring the data per se is to throw out the baby with the bathwater. I think it still may be weighting reproducibility too heavily. The premise seems to be that the weights and parameters will be too mysterious to figure out and adjust at that stage, so it requires complete retraining. So, for example, if the model inaccurately identifies Asian faces, is the current solution to retrain on a modified set of data or to just make adjustments to the weights and parameters?