[RFC] Separating concerns between Source Data and Processing Information

Shamar · September 15, 2024, 11:37pm

Reading about the various legal and technical issues that lead to the current draft and about the risks that several people raised about the vagueness of its terms, I realized that the Data Information definition conflates two orthogonal dimensions of the problem.

On one hand we have what @arandal describes as “source data” that “is to trained model weights as software source code is to binary executable” and thus must be available to grant both the freedom to study and the freedom to modify the system¹.
When designing the requirements about source data, we must take into account the various legal issues, but without opening loopholes that allow systems that do not really grant all of the 4 freedoms, to pass as “Open Source AI”.

On the other hand, we have the information about the process that is required to turn the source data into the weights of the model.
Such information is always under the developer rights and can easily be shared under a license matching the Open Source Definition.

Thus I propose to modify the draft by clarifying the different aspects in the “Preferred form” paragraphs:

The preferred form of making modifications to a machine-learning system is:

Source Data: All the data used to train the system must be available for use by the public either under licenses that comply with the Open Source Definition or under the same legal terms that allowed their use to train the system in the first place, so that a skilled person can recreate an exact copy of the system using the same data.

For example, if used, this would include the training data sets used and the checksum that grant their integrity, the values obtained from random sources during the training process and any other data required to compute the weights distributed with the system.

Processing Information: Sufficiently detailed information about the process used to train the system from the source data, so that a skilled person can recreate an exact copy of the system using the same data. Processing information shall be made available with licenses that comply with the Open Source Definition.

For example, if used, this would include the training methodologies and techniques, how to retrieve the source data and check their integrity, the labeling procedures and data cleaning methodologies, and any other information required to compute the weights distributed with the system from the source data.

Source Code: The source code used to train and run the system, made available with licenses that comply with the Open Source Definition.

For example, if used, this would include code used for pre-processing data, code used for training, validation and testing, supporting libraries like tokenizers and hyperparameters search code, inference code, and model architecture.

Weights: The model weights and parameters, made available with licenses that comply with the Open Source Definition.

For example, this might include checkpoints from key intermediate stages of training as well as the final optimizer state.

Such formulation would have several advantages:

it grants all the 4 freedom, with full transparency for users and researchers;
by requiring a reproducible training process, it would reduce security risks for users of Open Source AI, since injecting undetectable backdoors would be more risky for the developers (respect to injecting them into opaque and unreproducible weights)
it does not impose the distribution of the source data, as long as they are available to the public under the same terms that allowed their use in the training process for the original developers;
it does not specify specific licensing terms for the source data while allowing their distribution under a open source license;
it require an open source license for the processing information;
it works well with any legal exemption and exception trying to support AI research in the open without opening loopholes that would expose such regulations to abuse by bad actors;
it avoids systemic and institutionalized open-washing of bad actors that could pretend to be distributing an Open Source AI with the corresponding source data while distributing a different dataset (since the the process documented must allow a skilled person to compute the exact model weights from the source data);
it does not require “OSI-approved” licenses but “licenses that comply with the Open Source Definition” avoiding a single point of failure and a legal burden on OSI.

As far as I can see, such formulation would address every legal concern raised so far, including the ones described by @smaffulli and @Senficon, without opening huge loopholes that would be exploited to pass as “open source” without really granting the 4 freedoms to the users.

Of course, system that cannot legally share the source data would not qualify as Open Source AI, and that’s fine: we will have valuable “weights-available” AI systems just like we have valuable “source-available” software.

On the other hand, developers would have a clear incentive to grant all the 4 freedom to qualify their system as Open Source AI, without the toxic competition of open washers that would benefit from the ambiguities in the current draft.

¹ granting the “freedom to fine tune” cannot be enough to qualify as Open Source AI just like granting the “freedom to configure” cannot be enough to qualify as Open Source Software.

nick · September 16, 2024, 7:26pm

I’m sorry @Shamar, but this doesn’t make any sense.

It does not impose the distribution of the source data, as long as they are available to the public under the same terms that allowed their use in the training process for the original developers;

The rights to use the data for training are different than the rights to distribute such data. This has been repeatedly explained by others on this forum.

Of course, system that cannot legally share the source data would not qualify as Open Source AI, and that’s fine: we will have valuable “weights-available” AI systems just like we have valuable “source-available” software.

I don’t follow. Is the distribution of the source data imposed/required or not?

Source Data : All the data used to train the system must be available for use by the public either under licenses that comply with the Open Source Definition or under the same legal terms that allowed their use to train the system in the first place, so that a skilled person can recreate an exact copy of the system using the same data.

As others have repeatedly confirmed on this forum, it’s impossible to guarantee the recreation of an exact copy of an AI system like a LLM, even when using the same data.

Please present new arguments, don’t open new topics repeating the same arguments over and over, and leave space for others to comment. Don’t flood the conversation and be respectful of the guidelines.