Training data access

Aspie96 · February 9, 2024, 12:39am

Yes, but in this specific case it would mean that the open source community would miss the benefits of copyright exception otherwise very well aligned with software freedom by introducing a novel requirement to provide access to a separate asset when distributing a specific kind of software.

“Free-as-in-beer” is not related to this definition or this conversation at all.

Furthermore, a trained model is not binary code. It’s not even analogous to binary code. In fact, I already addressed what concerns the kind of representation of the AI system in my comments to the draft.

In a ML the mathematical and computer operations that will be performed are exactly defined, and exactly the same, whether training data is known or not.
ML models are sometimes defined as “black boxes”, but that’s regardless of whether you have the training data. Training data doesn’t clarify the “logic” in the decisions of the systems, which often, and actually most of the time for complex models, remains largely obscure.
My personal suspicion is that often there may not be anything which is both higher level and more intelligible than what’s clearly going on mathematically, but that’s besides the point.
The point is that, for the intelligibility of a ML system, training data is not even close to what source code is for a program. Source code describes the program in full detail in a human-readable language, while also being a form of the program itself. Training data is a separate asset, somewhat useful to understand the model, but not nearly as crucial nor as sufficient as source code.

The same goes for modifications. The original training data can be useful, but it’s neither necessary nor something that would allow for direct modifications the same way that source code does.

We already have a meaning to the phrase “source code”, in FLOSS, for things that aren’t written in text nor compiled as binary. Introducing a novel condition would be bad in practice, because it would fundamentally exclude many crucial kinds of models and would be even worse in principle because it would be inconsistent with the previous “free”/“open” definitions.

I’d like to expand slightly on this point by @stefano, which I really agree with.

When distributing an open source program, especially one under a (strong) copyleft license, such as the GPL, one must be prepared to provide access to the full corresponding source code.
Linking to a repo by someone else, while distributing binaries independently yourself, may not be enough, because that repo may be taken down. This hasn’t gotten much better over time, especially for not very well-known digital assets. So you must have some way to provide source code to all to whom you give binaries.

If the training data of a program is made available publicly under an open license, and I think it should be whenever possible, it’s not just the original authors that need to pass that (possibly large) dataset on.

Everyone who wishes to include the model in an AI program and “certify” that program as open source would also have to make sure to provide access to the training data. This seems, to me, to be more of an obstacle to software freedom than an advantage.

This isn’t an abstract or pedantic concern. Datasets absolutely could be taken down. In practice a link to the data may be enough, but I don’t see how this can be implemented as a requirement in a way which isn’t sloppy.

This isn’t anywhere close to being my main reason for thinking that access to training data mustn’t be part of the open source AI definition, it’s just an afterthought. The pragmatic difficulties may be more than I originally thought.