Draft v.0.0.9 of the Open Source AI Definition is available for comments

Shamar · September 5, 2024, 10:48pm

Totally agree with @thesteve0.

Systems based on machine learning techniques are composed of two kind of software: a virtual machine (with a specific architecture) that basically maps vectors to vectors, and a set of “weights” matrices that constitutes the software executed by such virtual machine (the “AI model”).

The source code of the virtual machine can be open source, so that given the proper compiler, we can create an exact copy of such software.

In the same way, the software executed by the virtual machine (usually referred to as “the AI model”) is encoded in a binary form that the specific machine can directly execute (the weight matrices). The source code of such binary is composed of all the data required to recreate an exact copy of the binary (the weights). Such data include the full dataset used but also any random seed or input used during the process, such as, for example, the initial random value used to initialize an artificial neural network.

Even if the weights are handy to modify an AI system, they are in no way enough to study it.

So, any system that does not provide the whole dataset required to recreate an exact copy of the model, cannot be defined open source.

Note that in a age of supply chain attacks that leverage opensource, the right to study the system also has a huge practical security value as arXiv:2204.06974 showed that you can plant undetectable backdoors in machine learning models.

Thus I suggest to modify the definition so that

Data information: Sufficiently detailed information about all the data used to train the system (including any random value used during the process), so that a skilled person can recreate an exact copy of the system using the same data. Data information shall be made available with licenses that comply with the Open Source Definition.

Being able to build a “substantially equivalent” system means not being able to build that system, but a different one. It would be like defining Google Chrome as “open source” just because we have access to Chromium source code.

When its training data cannot legally be shared, an AI system cannot be defined as “open source” even if all the other components comply with the open source definition, because you cannot study that system, but only the components available under the os license.

Such a system can be valuable, but not open source, even if the weights are available under a OSD compliant license, because they encode an opaque binary for a specific architecture, not source code.

Lets properly call such models and systems “freeware” and build a definition of OpenSource AI that is coherent with the OpenSource one.