Training data access

pchestek · February 9, 2024, 5:40am

@fontana I appreciate your distillation. I think of it as three pieces - software, model weights and parameters, and training data. That tracks with the current checklist of components, “code,” “model,” “data” and “other” (and I have a separate comment questioning whether there should be an “other”). So I have one more technical characteristic that is a distinction from software than you have.

That the software must meet the OSD is a no-brainer. For weights and parameters, I think the currently existing open source licenses conceptually are right, but there is a license enforcement problem with them that has to be solved if the weights and parameters aren’t copyrightable. But that’s not a definition problem, that’s an implementation problem.

This leaves the data, which is the hardest piece to me. If we assume that the training data is required for full transparency and reproducibility, what do we do about that? It has the same license enforcement problem as with weights/parameters, but what’s more challenging is that it also may contain information that it would be unlawful to provide access to (personally identifiable information, health data, financial data, copyrighted works). If we require access to data to call an AI system “open source,” open source AI may be a null set.

But data (like hardware) is typically outside the remit of the OSD - it just is a different animal. The EU AI act exclusion is for “AI models that are made accessible to the public under a free and open-source license whose parameters, including the weights, the information on the model architecture, and the information on model usage, are made publicly available.” Ihttp://www.openfuture.eu/wp-content/uploads/2023/12/231206GPAI_Compromise_proposalv4.pdf It doesn’t consider the training data as part of the system.

I’m leaning in the direction that data is just outside of scope. Why isn’t the AI system just the software to be applied to data and its resulting weights and parameters, why does it have to include the data too? The data isn’t needed to study the weights and parameters (or to to copy, modify and distribute them), only to reproduce them, but neither the Four Freedoms nor the OSD require that you be able to replicate the build. By analogy, the OSD doesn’t require that one must be able to faithfully replicate the built software. By requiring data, we’re elevating reproducibility above all else.

Moreover, saying that the data is required as part of the system will be to ratify that a particular product (a given foundational model, for example) is open source. But the OSI has never opined on whether a particular software program is open source.

So all of that is to say that I agree with you. I would toss data out, and the only problem left is to refine the OSD to include weights and parameters. There may be an implementation problem in coming up with a license that is enforceable for non-copyrightable content, but the OSI has never written licenses, only approved them. I have no doubt that it’s do-able, but it’s up to the community to write the license that accomplishes it. And the failure to accomplish it may still meet the OSD - if the content isn’t protected by a proprietary rights scheme, which is the hook for the license, everyone gets to use it as long as they have access.