Overarching concerns with Draft v.0.0.8 and suggested modifications

juliaferraioli · May 16, 2024, 4:00pm

Hi everyone,

After lengthy discussions, we have serious concerns about the feasibility of the draft in its current form based on our experiences as open source practitioners, machine learning researchers, and recovering compliance people.

We outline them below, and hope to provide a cohesive place to discuss them here in this post. The concerns are not linearly separable, which is why we elected to put them in one single post as opposed to breaking them up into smaller topics.

Overarching concerns

Need for data

First and most importantly, for an Open Source AI Definition to achieve its goals of modifiability, an AI system must include the data used to train the system. We are aware of the challenges that this poses for the definition, but the very Model Openness Framework the current definition references states that full transparency and reproducibility requires the release all datasets used to train, validate, test, and benchmark. For AI systems, data is the equivalent of source code, and we explicitly require that source code must be obtainable for software to qualify as open source. The current definition marks these as “optional”.

Where inclusion of datasets poses a privacy or legal risk, we suggest the use of equivalent synthetic data to meet this requirement, where the synthetic data achieves comparable results when training the model.

The required components in the Data Information section are not sufficient for someone to modify an AI system as defined. (Note: modification means changing the system before a model is trained, and therefore is more in-depth than fine-tuning, transfer learning, or similar techniques.) Inclusion of data sets is listed as optional, which means that the section might as well be elided. In fact, there is no requirement than the data used to train an Open Source AI system be licensed under an open license at all, unless the maintainer plans to publish them.

In this, the OSAID fails to meet the necessary high bar to ensure a practical and inclusive standard for Open Source AI. Practically, the OSAID is worded this way so that AI systems can be considered “Open Source AI” without having to publish the dependent data.

Furthermore, this introduces a loophole which we anticipate undermining the very nature of what it means to be open source.

Composition of components

Second, we are concerned that the composition of component licensing will introduce uncertainty and the possibility of a top-level license that adds restrictions preventing the user from fully exercising the four freedoms. Additionally, there is a concern that this top level license could impose restrictions on the output generated from AI systems. These points, combined with the complexity of the criteria, necessitates the establishment of a process to review and mark AI systems for compliance with the Open Source AI Definition. This work is nontrivial and would require staffing beyond the currently established framework of license reviews.

Without a plan for certification, we anticipate more “open-washing” that the field is experiencing and a dilution of the OSD. This is not in anyone’s best interests and we want to flag this as a risk.

Ambiguity of language

Third, the draft uses a number of words open to wide interpretation, such as “sufficiently”, “skilled”, “substantially equivalent”, “same or similar” without providing concrete guidance for what these terms mean in practice. Additionally, the definitions of “OSD-compliant license” and “OSD-conformant terms” are still under active discussion, the conclusion of which we consider a precondition for a proper evaluation of the draft.

Proposed modifications

We have concrete suggestions for addressing these concerns:

Require release of dependent datasets

Require that the dependent datasets for an AI model be released under an open data license. If any of the dependent datasets cannot be distributed for legal or regulatory reasons, a high quality equivalent synthetic dataset can be distributed instead.

To apply this change, replace the “Data information” section with the following:

Data: The data used to train and/or run the system. This includes initial training data as well as any data used to refine, fine-tune, retrain, or otherwise modify the system. If a dataset cannot be distributed for legal or regulatory reasons, a high quality equivalent synthetic dataset may be distributed in its place to meet this requirement. Any substituted synthetic datasets must be clearly marked as synthetic in corresponding documentation for the system.

Additionally, in the checklist section:

Move the “Data card” to the required section
Adjust the “Data Information” section of the “Optional Components” table to move the “Training data sets”, “Testing data sets”, “Validation data sets”, and “Benchmarking data sets” into the “Required components” table

Establish a certification process

Establish a certification process (and certification mark) for AI systems looking to be called Open Source AI. This ensures that a defensible position can be reached when (not if) vendors continue to refer to their offerings as “Open Source AI”. This process complements the OSAID, but would need to launch at the same time as version 1.0.

Clarify language

Add explicit sentences to the “Data”, “Code”, and “Model” definitions, under “Preferred form to make “modifications to machine-learning systems”, to make it unambiguous that these must be under OSI or OSD compliant licensing.

Data: All data must be under OSD-compliant licensing.
Code: All code must be under OSI-approved licensing, with the exception of code used to perform inference for benchmark tests and/or evaluation code. It is recommended that these items be under OSI-approved licenses, but they are not required.
Model: The model parameters must be available under OSD-conformant terms. The model architecture must be available under OSI-approved licensing.

Prevent restrictions on outputs of systems

Add prohibitions about restrictions as applied to the outputs of Open Source AI systems to ensure that data generated by the system is not bound by terms that restrict the use of that data for any purpose.

We suggest adding the following text to the OSAID:

An Open Source AI must not impose restrictions on the use, modification, or distribution of output files or data generated by the AI system.

Eliminate optional components

Remove all remaining optional components from the OSAID. To accomplish this, we propose amending the OSAID and deleting everything from “The following components are not required …” downward. This could potentially move to an annotated version (similar to the Annotated OSD).

Address combinatorial ambiguity

We suggest adding:

An Open Source AI can be represented as a distribution of multiple components. If a license is applied across this combination, it must be an OSD-approved license.

after the table of default required components.

We look forward to a discussion of these suggestions.

Sincerely,
julia ferraioli and Tom (Spot) Callaway
AWS Open Source

Mer · May 17, 2024, 12:08am

@juliaferraioli and Spot, thank for your serious and detailed review of v. 0.0.8 and for your proposals of alternative options. I look forward to reading others’ feedback on these ideas.

shujisado · May 17, 2024, 1:08am

The establishment of a Certification Process would be a separate discussion.

In OSAID, OSI-approved licenses also play a crucial role in practice. Historically, each OSI-approved license has been approved through a very simple process. Currently, OSI does not review whether the Software is open source, and once a license is approved, its certification is never revoked. OSI has always only judged the open source compliance of licenses. OSI has played a very small and simple role within the open source community, and that may have been the key to its success.

If we were to establish an open source AI certification process for AI systems, it would greatly expand the tasks OSI has been performing so far. While I agree that many AI vendors would require a certification process, doing this within OSI would require a very significant decision and resources. Speaking from the standpoint of my current company, LINE-Yahoo!, I would be in favor of it, but as an individual, I have concerns that it might deviate from the tasks OSI should be undertaking.

shujisado · May 17, 2024, 12:32pm

By the way, when you say “an open data license,” are you referring to a license approved by the Open Knowledge Foundation?
I don’t think that’s the case, but I just wanted to confirm.

jberkus · May 17, 2024, 10:56pm

Julia,

I definitely agree that the data must be covered in some way. That’s why we’re doing this whole Open Source AI definition exercise in the first place; if just the code were enough, we could use the OSD.

Aspie96 · May 18, 2024, 4:45am

I think that is a very liberal use of the word “equivalent”.

Without source code, you don’t know what operations a computer program is performing and you do not have it under a human-readable form in which the author has it. Source code and object code, also, are considered as two representations of the same asset, legally and conceptually.

When it comes to trained models, the dataset is not necessary to know what operations the model is performing (that is determined by the architecture and by the frameworks being used), except at a specific conceptual level which requires massive efforts to understand (with relatively poor results) and which data doesn’t reveal anyways. The dataset and the model are also not two different representations of the same thing and conversion in either direction can be far from trivial (while, with software, conversion can happen in at least one direction usually with little effort). While two different code-bases, developed separately, represent different programs, even if compiled, two different dataset sampled from essentially the same distribution will lead to substantially similar models, if the same architecture is used.

zack · May 18, 2024, 7:55am

Hello Julia, Spot, I am in full agreement with the position that datasets (training included) are part of the “preferred form of modification” of AI systems and, as such, needed to exercise the freedoms to study and modify them. I supported and still support making their release required in OSAID, under an open data license. (That too has been discussed in the past, but not retained up to v0.0.8, unfortunately.)

Your take on requiring the release of a “high quality equivalent synthetic dataset”, when the original dataset cannot be released, is novel and I quite like it. It would be great if that can be the compromise we reach to include the dataset requirement back in OSAID.

Note however that it is plagued by the same problem that you (correctly) criticize elsewhere, of leaving undefined what “equivalent” means, opening the door to potential loopholes. I don’t think that problem is fixable for a definition aiming to cover “AI systems” in general, and it doesn’t make things worse on this front for OSAID, so I don’t consider it a blocker for adopting the proposal.

Cheers

Danish_Contractor · May 18, 2024, 2:14pm

I think relaxing the dataset assumption to just having it being available for zero-charge regardless of whether its OSAID compliant would make it much easier to comply and remove all the vagueness from the draft.

juliaferraioli · May 19, 2024, 5:43pm

The establishment of a Certification Process would be a separate discussion.

Absolutely, and please see the other thread started here

I echo your concerns and that’s why we flagged them here. I do not see a way around the necessity of a certification process with the definition in its current form; it is too complex and has too many component parts with subjective requirements.

when you say “an open data license,” are you referring to a license approved by the Open Knowledge Foundation?

Not necessarily. What I was getting at (imprecisely, because the language around all of this has become so muddled) is data licensed in such a way that they may be used similar to how open source software is used, with the criteria.

juliaferraioli · May 19, 2024, 5:52pm

It’s not very liberal at all. Without the data, you don’t know what went into training the system. You could not recreate it, retrain it, or otherwise “recompile it”. A trained model, I posit, cannot be considered open source in and of itself without the data, processing code, and training code. The processing and training code even wouldn’t be enough. To simplify things, I find it useful to compare a trained model to a software binary (though I know Spot has issues with that comparison). We don’t call software binaries open source if the code behind it is not available and licensed as open source.

The initial training and evaluation datasets are hugely critical in understanding the underlying system, especially when making decisions about whether to use it as the foundation for your own modifications.

The question I keep coming back to is “could you recreate the model without the data?” Answers to that tend to be a flavor of “yes, well you can acquire your own data” but that undercuts the promise of open source and we won’t see nearly the innovation and empowerment that we have seen with open source software if we make data optional.

juliaferraioli · May 19, 2024, 5:55pm

Note however that it is plagued by the same problem that you (correctly) criticize elsewhere, of leaving undefined what “equivalent” means, opening the door to potential loopholes.

I agree that it is kicking the can down the road a bit. I would want to see a justification for why an AI system is using the option of synthetic data, which would give adopters the ability to judge for themselves if they buy the justification or not. The concept of synthetic data is the attempt to find a middle ground for data that can absolutely not be made public for reasons of safety/privacy/etc…