Training data access

Aspie96 · February 10, 2024, 5:13am

In my view, training data is not and cannot be “part of the preferred form” for modification because it’s not part of the model at all, regardless of how it’s represented.

It is an useful asset for modifications. So is documentation (which is possibly even more important than training data) and many other things which absolutely should be provided but aren’t part of the program and don’t hing on whether it’s open source.

To what @pchestek said, I’d like to something. A mostly unspoken, but rather crucial property of open source, in my view, has always been that practically all (kinds of) software can be open source if the author wishes it to be. When a program is not open source, the author took a (completely legitimate) decision to restrict software freedom, in some way. Unfortunate exceptions to this exist due to patents and DRM laws, but it’s true enough that practically any kind of user program can exist as open source.

Extending the definition of what (a form of) the model is to include training data would mean that this would no longer be true and only model which don’t qualify as “open source” could compete in many areas.

But even complete retraining won’t tell you the high-level reasons of many behaviors, if there even is one at all. It’s quite useful, but not more (and possibly less) than documentation.

The fact that these models are fuzzy statistical systems with often mysterious behaviors is really part of their nature, none’s fault and not a matter of software freedom.

zack · February 10, 2024, 5:38am

If we look at the various subsequent versions of the big models out there that strive to be open, every new version is a complete retraining with changes all over the place: in the dataset (usually bigger than before, but also more cleaned or differently cleaned), in the training pipeline, in the model architecture, etc. Of course, the applied changes build upon the previous version rather than starting from scratch (although that sometime happens too).

This is for me good evidence of the existence of relevant use cases where the “preferred form of modification” does include the training dataset.

On the other hand, we see this pattern of modification mainly by the same teams who published the previous version of a given model. The kind of modifications we tend to see by third parties (w.r.t. the initial publishers of a given model) happen more at the level of fine-tuning weights, hence without needing access to the original training set.

To my eyes, this means that we have different patterns of model modifications, some that require access to the training dataset, some that don’t. But note that the modification patterns that do need access to the training dataset are not just to reach some very high bar of “replicability” (which is nice to have, but should not be a requirement to be open — I agree with you on this), they are also simply to create new/better versions of an existing model.

shujisado · February 11, 2024, 5:34am

I believe that many of the current AI model forks rely on retraining. Therefore, it is important that training data be open and free.

However, I feel that if we easily require data freedom in the OSAID definition, there is a risk of some discrepancy in the relationship with existing open sources. Yes, this may be an excessive concern. However, it would be less worrisome if there is at least a clause in the OSAID article that gives priority to OSD in areas where OSD are applicable.

Also, just out of curiosity, the term Open Data already exists as a definition to indicate that data is freely available; has OSI ever discussed this with the Open Knowledge Foundation?

fontana · February 11, 2024, 6:17am

There’s a nuance here that may be important. The “preferred form” language in OSD 2 was taken from GPLv2 section 3 (and a number of open source licenses have also borrowed the “preferred form” formulation). But GPLv2 itself carves out an exception to the “preferred form” requirement, in the form of the system library exception. From the GPLv2 point of view, there are circumstances where the preferred form (at least arguably) includes some code that does not have to be provided in source code form – historically, situations where it was not possible to obtain free source code for proprietary operating system components.

Implicitly, in recognizing that GPLv2 is a quintessential example of an open source license, and in using the “preferred form” language in OSD 2, the OSI recognizes pragmatic limits to the preferred form requirement - or so I would argue.

While I wouldn’t overstate the analogy to the system library exception, there is at least one kind of commonality, in that in general it is not possible to provide training data for (e.g.) large language models under libre terms, or under any terms.

fontana · February 11, 2024, 6:34am

I completely agree with this. The practical effect of the OSI requiring availability of training data is either that (a) there will be no open source models, from the OSI’s standpoint, except possibly some relatively insignificant ones, or (b) the machine learning community will continue its current course of using “open source” to mean something completely unanchored to the OSD and the values it embodies. Or both.

shujisado · February 11, 2024, 6:37am

Aspie96 has given a very good opinion so there doesn’t seem to be much for me to write.

I was just wondering a little, how many countries are clear that data for training is not protected by copyright at this time?
Under Japanese copyright law, it is clearly stipulated that the act of using copyrighted material for machine learning cannot be restricted by copyright. Perhaps in the US, it is more likely to be determined as fair use. I am not sure about other countries.

For reference, an excerpt from the Japanese Copyright Law is placed below.

Article 30-4
It is permissible to exploit a work, in any way and to the extent considered necessary, in any of the following cases, or in any other case in which it is not a person’s purpose to personally enjoy or cause another person to enjoy the thoughts or sentiments expressed in that work; provided, however, that this does not apply if the action would unreasonably prejudice the interests of the copyright owner in light of the nature or purpose of the work or the circumstances of its exploitation:

(ii)if it is done for use in data analysis (meaning the extraction, comparison, classification, or other statistical analysis of the constituent language, sounds, images, or other elemental data from a large number of works or a large volume of other such data; the same applies in Article 47-5, paragraph (1), item (ii));

fontana · February 11, 2024, 6:41am

They assert that the dataset is “open”, but apparently the licenses of the dataset are these: AI2 ImpACT Licenses at least some of which are not libre. And that’s not to mention the items in the dataset. So this looks like the typical loose use of “open” that is currently prevalent in the machine learning community.

zack · February 11, 2024, 10:28am

@fontana indeed the system library exception carves out relevant code from the notion of “preferred form” in the GPL case, but I don’t see how that helps in the training data case (maybe you didn’t mean it to help; I’m pointing this out just in case I’m missing your point here).

The situation of training data seems to me to be quite the opposite of system libraries. In the system library case, whatever falls under that notion is assumed to be widely available on any system where the free software in question is meant to be used. It is also implicitly assumed that you can exchange one system library for another, with no practical impact.

Neither of these is true for training data: it is only available to system producers, and you can definitely not exchange one for another hoping to obtain similar results.

Aspie96 · February 11, 2024, 6:54pm

I think A would almost definitely happen. B may or may not.

There is something else I’d like to point out, however.

Normally, in open source, the licensor of the source code and the licensor of object code are the same person. This is because they are the same asset, the same program and have the same authors and rightholders.

Extending the concept of “source code” to include training data would also create a discrepancy regarding this, as well.

Alek_Tarkowski · February 13, 2024, 8:38am

@stefano, I hope that it’s Ok to reopen this conversation

You wrote a while ago:

Can you write a bit more about the Llama 2 review exercise. I understand that the WG was considering specifically access to testing data. Are they also considering access to training data?

Alek_Tarkowski · February 13, 2024, 8:48am

Hello @Aspie96 I think that this EU rule does not necessarily conflict with a data sharing obligation. The Article concerns retention of “Reproductions and extractions”. I can imagine a data sharing obligations that only requires information / metadata to be released.

In general, any data-sharing rules should distinguish between obligation to make data available and an obligation to disclose the (training) data. The first requirement is in many scenarios hard to meet, for example if the model was trained PETs ( @stefano mentioned this scenario). But the second requirement can always be met, at some leve of granularity / abstraction.

It’s of course then an important question whether disclosure is enough.

This is where I think the conversation from data could benefit from a clear technical perspective, that considers to what extent access to data is needed, in the AI development practice, to meet the four freedoms.

The other perspective is of course ethical / normative, and from this perspective the broadest possible access would be recommended (“as open as possible, as closed as necessary”, mainly to preserve different data rights).

Alek_Tarkowski · February 13, 2024, 9:00am

This is a very interesting idea, to set a higher / additional standard. Especially that, as you mention, open source and data access are orthogonal issues (though the connection is at least for me still not spelled clearly enough, especially in terms of implications of data access for securing freedoms in to use components of the AI system).

This orthogonality suggests that this higher standard should combine open access standards with data standards (for example, through the Open Definition; some conceptual work done by the Open Data Charter also fit here)

pchestek · February 14, 2024, 3:30am

Would the software code that is used to create the behaviors inform this? The code is going to be under an open source license.

Aspie96 · February 14, 2024, 5:04am

Matadata is data about data and the original data may go offline tomorrow, so metadata doesn’t make the system any closer to any “open source” label that requires sharing of data.

The question is not (just) whether it can be met, but whether it should be part of an “open source AI definition”. I see no logic that would require metadata about (possibly proprietary) training data be a requirement for anything to be open source.

Ultimately open source software is useful, but what’s useful and what’s open source are two very different questions. I think we all agree training data (or, at least, information about it) is quite useful. So is documentation in general, but it’s a separate question from what qualifies a system as open source. A lot of open source software has either no or horrible documentation.

This document isn’t really a description about what ought to be done or how models ought to be released, but rather what can and what mustn’t be referred to as “open source” (and, I suppose, remain compatible with future copyleft licenses).

Aspie96 · February 14, 2024, 5:09am

In many cases, not very much.

The behavior at a low level and at a mathematical level, with various levels of granularity, is entirely well-understood and completely clear just from the model architecture and the inference code. No training code needed.

When it’s said that certain models are “black boxes” this is in the sense that we often don’t know why the operations of the model lead to certain results, or why models tend to behave in a certain why, or even what the “meaning” of internal parameters is, if there is one. People do investigate the answer to these questions, but it’s often quite hard and, personally, I’m not at all confident that an intelligible answer even exists in all circumstances. But, if there is, I don’t see it being in the training code, regardless of its license.

stefano · February 14, 2024, 10:42am

Yes, we’re in touch with people at the OKN. The main issue though is not the definition of open data, though.

The problem we’re discussing is caused by the nature of ML systems: these require data to create something new that is based on data but is not data anymore. The AI/ML system is a brand new artifact. This use case for data is new. The open data collections you see out there don’t seem to have thought of this use case either, except a few notable exceptions.

fontana · February 15, 2024, 7:50pm

One of the reasons that the system library exception was originally created (AFAIK) is that it was not possible to obtain and distribute the source code of proprietary system libraries. That seems to be somewhat analogous to the situation with training data. I admit it is not a perfect analogy by any stretch. But it illustrates how this question of “how much stuff do I have to provide” has been interpreted pragmatically.

shujisado · February 17, 2024, 6:42am

I understand what you are saying.

However, under Japanese law, there are no intellectual property rights in the dataset for training, nor in the data part of the AI model. Both are merely data. Thus, we were concerned that OKF would create a definition based on a different view than ours.

Aspie96 · February 17, 2024, 6:05pm

One issue about laws is that laws change, not just from place to place, but also through time.

However we choose to define anything, we must do so with some good level of foresight, so that definitions can remain as consistent as possible in time.

The same applies, IMO, to the text of licenses. I don’t think all open source licenses do it, but it’s something to keep in mind.

lumin · February 26, 2024, 5:05am

Exactly. A pre-trained neural network without its dataset being publically available, is still factually controlled by its creator. It does not encourage people to make improvements over the original pre-trained neural network. Without the original dataset, when the community tries to make improvements through alternative datasets, there is no controlled experiment from which we can tell whether there is really an improvement. This is critical to academia. Open-source is important for academia.

People cannot exclude this requirement just because “training it again or making modification in its architectures etc” is “too high-end / power-user-specific / advanced” to a generic user. The open-source definition is designed to people including machine learning engineers, researchers, and scientists. Do not exclude those minority from the potential audience of the OSAID.

If OSAID does not require original training dataset, I’d say it will become a historical mistake.