What does "preferred form" really mean in Open Source?

The latest OSI blog about the OSAID post borrows from the GNU GPL the concept of preferred form for making modifications, “a cornerstone of Open Source licensing”.

The article than goes on describing the (flawed) process that led to the 0.0.9 draft based on the conclusion that

data is essential for understanding and studying the system, it’s not the “preferred form” for making modifications.

Indeed, you can make a few limited modifications to an AI system through fine-tuning (somewhat like you can tweak configuration settings in your browser) but several other modifications are completely forbidden without the data.
For example, without the data, you cannot reduce the vocabulary of a LLM.

And given the frequent breakthrough in the field, we can easily expect forthcoming techniques whose model breaks if fine tuned. Yet even with such techniques, the availability of source data would enable modification of the system.

Indeed even during the development of a single new AI model, builders attempt several alternative modifications of the data (filtering, deduplications, normalization and so on) to get modified versions of the model to be compared in the cross-validation phase.

So, if data is NOT part of the “preferred form” for making modifications to an AI, why AI builders keep modifying the data themselves?

Because they prefer to modify data as well.

Thus, no definition of “preferred form” can exclude anything that AI builders prefer to use while modifying their own AI systems.

2 Likes

Meanwhile, my straw poll closed (n=15) with around 9 in 10 selecting Data over Model:

What is the “preferred form” in which a practitioner would modify a model?
Data (i.e., training datasets) 87%
Model (i.e., weights & biases) 13%

Sam

2 Likes

With “model weights” being the “form of modification” is like: I gave you a pre-compiled binary executable program, under the MIT license, but I don’t provide you the code from which I created the binary executable. That said, you have the full freedom to hexdump it, hexedit it, and redistribute your hexedit outcome, etc. Have fun with the binary blob.

Simply makes no sense to me.

5 Likes