Welcome diverse approaches to training data within a unified Open Source AI Definition

The Open Source Definition is a unified umbrella that intentionally makes room for diversity within the whole: it includes permissive and copyleft licenses, licenses that permit linking with proprietary code and those that don’t, licenses that require a name/version change for proprietary versions and those that don’t, etc… This unified diversity is a crucial part of the success of the OSD.

Open Source software also exists within the context of other open movements: open hardware and open silicon lower down in the stack, and open data and open content higher up the stack. While software is the primary mission of the OSI, it’s important for us to be mindful of these other open movements, be careful not to undermine their success, and acknowledge the compounding benefits of openness at multiple levels of the stack, such as combining open source software with open data.

With that in mind, we need to be clear that an AI model is not purely software, it is the result of applying an algorithm (source code) to a specific training data set (source data). This means it is a compiled form of both source code and source data. Some Open Source AI developers won’t mind if their AI source code is “linked” with proprietary source data, similar to the way that some Open Source software developers don’t mind if their source code is linked with or embedded within proprietary software. But, in keeping with the philosophy of OSD, we need to embrace the fact that some developers do care, and do not want their Open Source AI code to be linked with proprietary source data, similar to the way that some Open Source software developers do not want their source code to be linked to or embedded within proprietary software. There are clear practical benefits to combining Open Source AI with open training data: it grants a more comprehensive right to study the AI system and understand how it works; it grants a more comprehensive right to modify the AI model by modifying both the source data and source code to train a new model; it grants a more comprehensive right to share more of the Open Source AI system; and it enables additional beneficial features when using the system, such as checking whether AI generative output includes near copies of the training data (which is a perfectly reasonable and statistically likely output from generative AI) and whether the license, terms, or preference signals of the training data are compatible with the intended use of the generated output.

The OSD explicitly includes some language about what an open source license “may” restrict or require, together with language about what it “must” restrict and allow. Those clearly defined options within the text of the OSD make the unified diversity of the open source community possible. If we can do the same with the Open Source AI Definition, and clearly articulate the principles of open training data while also allowing for diversity, it will better support the long-term success of Open Source AI, as well as its developers, deployers, and users. TBH, the 0.0.9 version is very close to doing this already, you could even say that a diversity of approaches to training data is implied within the definition, and more obvious in the FAQ and checklist. But this diversity isn’t as clearly stated in the Open Source AI Definition as it is in the OSD, and if we’ve learned anything from decades of pouring over the OSD as software evolved for new use cases in new problem domains, and new licenses emerged to address new challenges, we know that an ounce of clarity in the definition can save us a world of pain over the coming decades.

I propose two small text changes to the 0.0.9 draft of the Open Source AI definition:

  1. Update the first bullet point under the section heading “Preferred form to make modifications to machine-learning systems” to explicitly make room (within a unified Open Source AI Definition) for the diversity of approaches to training data that already exist in the open source community, and clearly articulate the principles that make an Open Source AI system “open” when the training data is not (it can be retrained from scratch):
  • Data: Sufficiently detailed information about the source data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data. Data information shall be made available with licenses that comply with the Open Source Definition. In addition, the system may require that all source data must be made available under licenses that comply with the Open Source Definition. If the system permits proprietary source data, it must grant the right and provide the means to modify the system to use only source data that complies with the Open Source Definition.
  1. Update the first bullet point under the section heading “Open Source models and Open Source weights” to more accurately capture the role of data in an AI model:
  • An AI model is the output of an algorithm (source code) applied to a training data set (source data). It consists of the model architecture, model parameters (including weights) and inference code for running the model.
2 Likes

I love the spirit of this, but would simply add that no OSI licenses actually handle “data”…this is why, in my opinion, it’s critical for the OSI to work with Creative Commons, who does actually maintain the ecosystem of “data licenses” that would apply under your proposed revision.

Agreed, and if you look at Checklist v. 0.0.9, you’ll see that working with Creative Commons or others who define open data/open content licenses is already necessary even without this proposed revision.

Thanks for your comments @arandal and @mjbommar. Indeed working with organizations such as Creative Commons who maintain the ecosystem of “data licenses” is critical. Some of the organizations that we are constantly in touch with include MLCommons, the Open Future Foundation and the Data and Trust Alliance. The latter one has recently published the Data Provenance Standards which we highly recommend.

I mean something stronger - that the OSI cannot realistically finalize any OSAID standard without explicit coordination with Creative Commons, since models and four freedoms thereon rely on non-software works in AI.

IMO, the OSI should also explore AI preference signalling and ideally share standards with CC and SPDX/LF re: this option, and the OSAID standard should explicitly contemplate preference signals as part of its roadmap.

1 Like

@arandal thank you for making the point that open source software, and open source AI, need to be considered in the broader context of other open movements. I fully agree.
On this, and also replying to @mjbommar, great point about working with CC (disclosure: I’m on the board of CC). But I think the cooperation needs to go further, and not limit itself to licensing standards.
One key mechanism to pay attention to is the Open Definition - it has been stewarded by Open Knowledge Foundation, but actually had a community-driven governance model. The Open Definition is, unfortunately, currently dormant. It’s crucial, as it does for content what the OSI definition does for open source code.
Going with this analogy, I think that considering the Open Definition is not enough. Because the issue goes beyond defining licensing standard. Just as there was a need for OSI to work on a definition of open source AI, there’s a need to set a standard for how various resources (considered data from the perspective of AI development) are governed, made available, and used. This standardization work has not yet been undertaken, and at best would be a shared effort by various orgs mentioned in this thread.

2 Likes

I also want to comment on this idea - it sounds like an “upstream copyleft” clause, that would apply in cases where a given system is rebuilt using different data. I think that it’s conceptually interesting, but such a mechanism could not rely on traditional copyleft frameworks - Open Future has commissioned a study that shows general challenges with copyleft in the space of AI development: The Impact of Share Alike/CopyLeft Licensing on Generative AI – Open Future
(I hope that I am understanding correctly your idea)

Nod, you are absolutely right in recognizing the similarity to copyleft. But, this isn’t literally a copyleft software license or CC Share Alike license, it’s taking a step back to apply the fundamental principles of a “strong” form of software freedom in a different technical domain. As technology evolves, the legal tools we use evolve too. The original Open Source Definition never said “copyleft”, it defined the principles in a way that left room for specific legal tools to evolve. For historical perspective, think about how the AGPL was introduced years after the OSD was codified, to address the fact that earlier copyleft licenses hadn’t considered SaaS at all.

The post you linked is talking about the limitations of existing copyleft/Share Alike licenses in the context of AI training, and mentions two possible approaches to resolve the limitations. Other groups are already discussing other possible approaches.

Thanks, @arandal . Do you have any links that describe these possible approaches? I’ll appreciate any leads.

@Alek_Tarkowski It’s very much a work in progress, and there are far more conversations than publications, which seems about right for where the tech, the industry, and the community are now. Some discussions I’ve personally participated in where the topic came up were Software Freedom Conservancy’s Committee On AI-Assisted Programming and Copyleft, OpenInfra’s Policy for AI Generated Content, and Creative Common’s NYC Workshop on Preference Signals. Many similar conversations are happening across many different organizations and projects, and it’ll take a while for them to converge.