Welcome diverse approaches to training data within a unified Open Source AI Definition

arandal · August 30, 2024, 5:24pm

The Open Source Definition is a unified umbrella that intentionally makes room for diversity within the whole: it includes permissive and copyleft licenses, licenses that permit linking with proprietary code and those that don’t, licenses that require a name/version change for proprietary versions and those that don’t, etc… This unified diversity is a crucial part of the success of the OSD.

Open Source software also exists within the context of other open movements: open hardware and open silicon lower down in the stack, and open data and open content higher up the stack. While software is the primary mission of the OSI, it’s important for us to be mindful of these other open movements, be careful not to undermine their success, and acknowledge the compounding benefits of openness at multiple levels of the stack, such as combining open source software with open data.

With that in mind, we need to be clear that an AI model is not purely software, it is the result of applying an algorithm (source code) to a specific training data set (source data). This means it is a compiled form of both source code and source data. Some Open Source AI developers won’t mind if their AI source code is “linked” with proprietary source data, similar to the way that some Open Source software developers don’t mind if their source code is linked with or embedded within proprietary software. But, in keeping with the philosophy of OSD, we need to embrace the fact that some developers do care, and do not want their Open Source AI code to be linked with proprietary source data, similar to the way that some Open Source software developers do not want their source code to be linked to or embedded within proprietary software. There are clear practical benefits to combining Open Source AI with open training data: it grants a more comprehensive right to study the AI system and understand how it works; it grants a more comprehensive right to modify the AI model by modifying both the source data and source code to train a new model; it grants a more comprehensive right to share more of the Open Source AI system; and it enables additional beneficial features when using the system, such as checking whether AI generative output includes near copies of the training data (which is a perfectly reasonable and statistically likely output from generative AI) and whether the license, terms, or preference signals of the training data are compatible with the intended use of the generated output.

The OSD explicitly includes some language about what an open source license “may” restrict or require, together with language about what it “must” restrict and allow. Those clearly defined options within the text of the OSD make the unified diversity of the open source community possible. If we can do the same with the Open Source AI Definition, and clearly articulate the principles of open training data while also allowing for diversity, it will better support the long-term success of Open Source AI, as well as its developers, deployers, and users. TBH, the 0.0.9 version is very close to doing this already, you could even say that a diversity of approaches to training data is implied within the definition, and more obvious in the FAQ and checklist. But this diversity isn’t as clearly stated in the Open Source AI Definition as it is in the OSD, and if we’ve learned anything from decades of pouring over the OSD as software evolved for new use cases in new problem domains, and new licenses emerged to address new challenges, we know that an ounce of clarity in the definition can save us a world of pain over the coming decades.

I propose two small text changes to the 0.0.9 draft of the Open Source AI definition:

Update the first bullet point under the section heading “Preferred form to make modifications to machine-learning systems” to explicitly make room (within a unified Open Source AI Definition) for the diversity of approaches to training data that already exist in the open source community, and clearly articulate the principles that make an Open Source AI system “open” when the training data is not (it can be retrained from scratch):

Data: Sufficiently detailed information about the source data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data. Data information shall be made available with licenses that comply with the Open Source Definition. In addition, the system may require that all source data must be made available under licenses that comply with the Open Source Definition. If the system permits proprietary source data, it must grant the right and provide the means to modify the system to use only source data that complies with the Open Source Definition.

Update the first bullet point under the section heading “Open Source models and Open Source weights” to more accurately capture the role of data in an AI model:

An AI model is the output of an algorithm (source code) applied to a training data set (source data). It consists of the model architecture, model parameters (including weights) and inference code for running the model.

mjbommar · August 30, 2024, 8:08pm

I love the spirit of this, but would simply add that no OSI licenses actually handle “data”…this is why, in my opinion, it’s critical for the OSI to work with Creative Commons, who does actually maintain the ecosystem of “data licenses” that would apply under your proposed revision.

arandal · August 30, 2024, 9:15pm

Agreed, and if you look at Checklist v. 0.0.9, you’ll see that working with Creative Commons or others who define open data/open content licenses is already necessary even without this proposed revision.

nick · August 30, 2024, 9:39pm

Thanks for your comments @arandal and @mjbommar. Indeed working with organizations such as Creative Commons who maintain the ecosystem of “data licenses” is critical. Some of the organizations that we are constantly in touch with include MLCommons, the Open Future Foundation and the Data and Trust Alliance. The latter one has recently published the Data Provenance Standards which we highly recommend.

mjbommar · August 30, 2024, 9:46pm

I mean something stronger - that the OSI cannot realistically finalize any OSAID standard without explicit coordination with Creative Commons, since models and four freedoms thereon rely on non-software works in AI.

IMO, the OSI should also explore AI preference signalling and ideally share standards with CC and SPDX/LF re: this option, and the OSAID standard should explicitly contemplate preference signals as part of its roadmap.

Alek_Tarkowski · September 4, 2024, 10:21am

@arandal thank you for making the point that open source software, and open source AI, need to be considered in the broader context of other open movements. I fully agree.
On this, and also replying to @mjbommar, great point about working with CC (disclosure: I’m on the board of CC). But I think the cooperation needs to go further, and not limit itself to licensing standards.
One key mechanism to pay attention to is the Open Definition - it has been stewarded by Open Knowledge Foundation, but actually had a community-driven governance model. The Open Definition is, unfortunately, currently dormant. It’s crucial, as it does for content what the OSI definition does for open source code.
Going with this analogy, I think that considering the Open Definition is not enough. Because the issue goes beyond defining licensing standard. Just as there was a need for OSI to work on a definition of open source AI, there’s a need to set a standard for how various resources (considered data from the perspective of AI development) are governed, made available, and used. This standardization work has not yet been undertaken, and at best would be a shared effort by various orgs mentioned in this thread.

Alek_Tarkowski · September 4, 2024, 10:26am

I also want to comment on this idea - it sounds like an “upstream copyleft” clause, that would apply in cases where a given system is rebuilt using different data. I think that it’s conceptually interesting, but such a mechanism could not rely on traditional copyleft frameworks - Open Future has commissioned a study that shows general challenges with copyleft in the space of AI development: The Impact of Share Alike/CopyLeft Licensing on Generative AI – Open Future
(I hope that I am understanding correctly your idea)

arandal · September 4, 2024, 5:03pm

Nod, you are absolutely right in recognizing the similarity to copyleft. But, this isn’t literally a copyleft software license or CC Share Alike license, it’s taking a step back to apply the fundamental principles of a “strong” form of software freedom in a different technical domain. As technology evolves, the legal tools we use evolve too. The original Open Source Definition never said “copyleft”, it defined the principles in a way that left room for specific legal tools to evolve. For historical perspective, think about how the AGPL was introduced years after the OSD was codified, to address the fact that earlier copyleft licenses hadn’t considered SaaS at all.

The post you linked is talking about the limitations of existing copyleft/Share Alike licenses in the context of AI training, and mentions two possible approaches to resolve the limitations. Other groups are already discussing other possible approaches.

Alek_Tarkowski · September 4, 2024, 8:14pm

Thanks, @arandal . Do you have any links that describe these possible approaches? I’ll appreciate any leads.

arandal · September 4, 2024, 10:33pm

@Alek_Tarkowski It’s very much a work in progress, and there are far more conversations than publications, which seems about right for where the tech, the industry, and the community are now. Some discussions I’ve personally participated in where the topic came up were Software Freedom Conservancy’s Committee On AI-Assisted Programming and Copyleft, OpenInfra’s Policy for AI Generated Content, and Creative Common’s NYC Workshop on Preference Signals. Many similar conversations are happening across many different organizations and projects, and it’ll take a while for them to converge.

stefano · September 13, 2024, 5:09pm

Thanks @arandal for taking the time to recommend edits to the definition.

Absolutely! We’ve been and we still are in constant touch with the leadership of CC and OKF during this whole process to make sure that none of the other “opens” would be negatively affected.

If I understand correctly your suggested edits in #1 can be summarized as:

Rename data as “source data”
Add language that explicitly allows the upstream developer of Open Source AI to require that downstream modifications are only made with “open data” (allow the persistence of requirements, as in copyleft/share-alike licenses)
Add language to state that, if the system is trained on unshareable non-public training data, allow the downstream developers to use open data to fine tune that model

In update #2 you seem to be arguing that training data is to trained model weights as software source code is to binary code.

Did I understand your comments correctly?

shujisado · September 15, 2024, 2:27am

I agree with this.

In the OSAID checklist, we specify the legal conditions for datasets as “OSD-compliant license,” and I believe many participants here interpret this as implicitly including CC4 and CC0. This is a reasonable interpretation since many datasets are licensed under Creative Commons. However, OSI has not yet made a definitive assessment of Creative Commons licenses, and CC0 was reviewed in 2012, but no decision was made to declare it compliant with the OSD. I recall that there were concerns about its weakened defenses against patent claims.

Until now, OSI has only evaluated software licenses, but moving forward with OSAID, we will likely need to learn much from Creative Commons.

So, personally…, Section 2.b.1 of CC BY 4.0 states the following, but I am curious to hear opinions from organizations that actually operate under this license, such as whether there are cases where “to the extent possible” does not apply or whether this clause works effectively in most jurisdictions.

Moral rights, such as the right of integrity, are not licensed under this Public License, nor are publicity, privacy, and/or other similar personality rights; however, to the extent possible, the Licensor waives and/or agrees not to assert any such rights held by the Licensor to the limited extent necessary to allow You to exercise the Licensed Rights, but not otherwise.

mjbommar · September 15, 2024, 9:33am

Yes, I recall that as well. Ironic to me, since the patent defense issue a la Apache 2 is exactly the Achilles’ heel that the OSI definition is fighting to keep re: training data and infringement.

Shamar · September 15, 2024, 9:44am

While I’d support the first two change sugestions, and I totally agree that training data is to trained model weights as software source code is to binary executable, I can’t see how a “system trained on unshareable non-public training data” could match an “Open Source AI” definition.

Such system would not provide the freedom to study the system and would limit the freedom to modify the model in a huge way.

Open Source grants the freedom to study and modify or just the freedom to fine tune?