The Open Source Definition is a unified umbrella that intentionally makes room for diversity within the whole: it includes permissive and copyleft licenses, licenses that permit linking with proprietary code and those that don’t, licenses that require a name/version change for proprietary versions and those that don’t, etc… This unified diversity is a crucial part of the success of the OSD.
Open Source software also exists within the context of other open movements: open hardware and open silicon lower down in the stack, and open data and open content higher up the stack. While software is the primary mission of the OSI, it’s important for us to be mindful of these other open movements, be careful not to undermine their success, and acknowledge the compounding benefits of openness at multiple levels of the stack, such as combining open source software with open data.
With that in mind, we need to be clear that an AI model is not purely software, it is the result of applying an algorithm (source code) to a specific training data set (source data). This means it is a compiled form of both source code and source data. Some Open Source AI developers won’t mind if their AI source code is “linked” with proprietary source data, similar to the way that some Open Source software developers don’t mind if their source code is linked with or embedded within proprietary software. But, in keeping with the philosophy of OSD, we need to embrace the fact that some developers do care, and do not want their Open Source AI code to be linked with proprietary source data, similar to the way that some Open Source software developers do not want their source code to be linked to or embedded within proprietary software. There are clear practical benefits to combining Open Source AI with open training data: it grants a more comprehensive right to study the AI system and understand how it works; it grants a more comprehensive right to modify the AI model by modifying both the source data and source code to train a new model; it grants a more comprehensive right to share more of the Open Source AI system; and it enables additional beneficial features when using the system, such as checking whether AI generative output includes near copies of the training data (which is a perfectly reasonable and statistically likely output from generative AI) and whether the license, terms, or preference signals of the training data are compatible with the intended use of the generated output.
The OSD explicitly includes some language about what an open source license “may” restrict or require, together with language about what it “must” restrict and allow. Those clearly defined options within the text of the OSD make the unified diversity of the open source community possible. If we can do the same with the Open Source AI Definition, and clearly articulate the principles of open training data while also allowing for diversity, it will better support the long-term success of Open Source AI, as well as its developers, deployers, and users. TBH, the 0.0.9 version is very close to doing this already, you could even say that a diversity of approaches to training data is implied within the definition, and more obvious in the FAQ and checklist. But this diversity isn’t as clearly stated in the Open Source AI Definition as it is in the OSD, and if we’ve learned anything from decades of pouring over the OSD as software evolved for new use cases in new problem domains, and new licenses emerged to address new challenges, we know that an ounce of clarity in the definition can save us a world of pain over the coming decades.
I propose two small text changes to the 0.0.9 draft of the Open Source AI definition:
- Update the first bullet point under the section heading “Preferred form to make modifications to machine-learning systems” to explicitly make room (within a unified Open Source AI Definition) for the diversity of approaches to training data that already exist in the open source community, and clearly articulate the principles that make an Open Source AI system “open” when the training data is not (it can be retrained from scratch):
- Data: Sufficiently detailed information about the source data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data. Data information shall be made available with licenses that comply with the Open Source Definition. In addition, the system may require that all source data must be made available under licenses that comply with the Open Source Definition. If the system permits proprietary source data, it must grant the right and provide the means to modify the system to use only source data that complies with the Open Source Definition.
- Update the first bullet point under the section heading “Open Source models and Open Source weights” to more accurately capture the role of data in an AI model:
- An AI model is the output of an algorithm (source code) applied to a training data set (source data). It consists of the model architecture, model parameters (including weights) and inference code for running the model.