GPL2 kernel source as an MIT-licensed model: Is this really open?

As discussed with @stefano and @spotaws on LinkedIn, I am deeply concerned that we are ignoring the legal status of the most important dependency for this draft - the training data.

We would never support a standard that openly ignored transitive dependencies in software. We’ve spent decades defending this common sense understanding, updating licenses for changes in application architectures, and litigating where necessary to ensure that the commons is not abused.

The current draft is equivalent to Creative Commons recommending that AGPL licenses can be freely ignored due to their cumbersome friction on innovation. Or suggesting that de minimis use of “just a few functions accidentally” allows you to ignore the GPL3 terms.

Toy Model

To underline the point, I have trained a simple model designed to show why the current draft must change.

This model, which is a 5M parameter transformer that memorizes and emits the source code to kernel v1.0 from 1994, is fully-compliant with the OSAID 0.0.8 draft.

The entire pipeline, from data collection and preprocessing to training, is on GitHub under MIT. GitHub - mjbommar/laam: linux as a model (is it MIT or GPL2?)

The model weights are available under MIT on Hugging Face.

But the model is a just compressed representation of the GPL2 kernel.

Can any encoding, lossless or otherwise, trump the plain language spirit and letter of license terms?

If I drop 5% of files and compress them into a .tgz, can I relicense Ghostscript sources under Apache 2?

This cannot possibly be the kind of thing the OSI would support.

More on Memorization/Distribution

While designed as an exaggerated toy example to simplify the point, empirical research on real models shows that LLMs unsurprisingly do memorize and emit copies of input data - i.e., they are still, even if difficult to predict, distributing content.

  1. Emergent and Predictable Memorization in Large Language Models. Biderman et al., 2023.

  2. Quantifying Memorization Across Neural Language Models. Carlini et al., 2022.

  3. Patronus CopyrightCatcher

Recommendation

Please, as others including even @pchestek have expressed, include common-sense requirements for the most important aspect of these models - the input training data.

While I understand that historically such non-software works have been the purview of others, there is an obvious partner in Creative Commons and their ecosystem, who have spent decades on a parallel path. I discussed the issue with Lessig himself last year, and even he is concerned about a future where the open information commons collapses.

If concerns around training data requirements are focused on the asymmetry with oligopolists, then it has never been more important to work with organizations who support creators of other types, e.g., Creative Commons, to protect the property rights and coordination mechanisms that we rely on.

Defecting from open here will end like all Prisoner’s dilemma games do…

1 Like

No one seems to want to engage with the toy example.

What about something like gpt-4o doing it?

Why did we spend decades protecting copyleft to let something like this happen now?

This prompt-and-parameter setup is very similar to what happens inside many applications using AI today. Low-temperature look-ahead use cases are already here inside many apps.

Original: linux/fs/ext4/acl.c at master · torvalds/linux · GitHub