Hi everyone,
After lengthy discussions, we have serious concerns about the feasibility of the draft in its current form based on our experiences as open source practitioners, machine learning researchers, and recovering compliance people.
We outline them below, and hope to provide a cohesive place to discuss them here in this post. The concerns are not linearly separable, which is why we elected to put them in one single post as opposed to breaking them up into smaller topics.
Overarching concerns
Need for data
First and most importantly, for an Open Source AI Definition to achieve its goals of modifiability, an AI system must include the data used to train the system. We are aware of the challenges that this poses for the definition, but the very Model Openness Framework the current definition references states that full transparency and reproducibility requires the release all datasets used to train, validate, test, and benchmark. For AI systems, data is the equivalent of source code, and we explicitly require that source code must be obtainable for software to qualify as open source. The current definition marks these as “optional”.
Where inclusion of datasets poses a privacy or legal risk, we suggest the use of equivalent synthetic data to meet this requirement, where the synthetic data achieves comparable results when training the model.
The required components in the Data Information section are not sufficient for someone to modify an AI system as defined. (Note: modification means changing the system before a model is trained, and therefore is more in-depth than fine-tuning, transfer learning, or similar techniques.) Inclusion of data sets is listed as optional, which means that the section might as well be elided. In fact, there is no requirement than the data used to train an Open Source AI system be licensed under an open license at all, unless the maintainer plans to publish them.
In this, the OSAID fails to meet the necessary high bar to ensure a practical and inclusive standard for Open Source AI. Practically, the OSAID is worded this way so that AI systems can be considered “Open Source AI” without having to publish the dependent data.
Furthermore, this introduces a loophole which we anticipate undermining the very nature of what it means to be open source.
Composition of components
Second, we are concerned that the composition of component licensing will introduce uncertainty and the possibility of a top-level license that adds restrictions preventing the user from fully exercising the four freedoms. Additionally, there is a concern that this top level license could impose restrictions on the output generated from AI systems. These points, combined with the complexity of the criteria, necessitates the establishment of a process to review and mark AI systems for compliance with the Open Source AI Definition. This work is nontrivial and would require staffing beyond the currently established framework of license reviews.
Without a plan for certification, we anticipate more “open-washing” that the field is experiencing and a dilution of the OSD. This is not in anyone’s best interests and we want to flag this as a risk.
Ambiguity of language
Third, the draft uses a number of words open to wide interpretation, such as “sufficiently”, “skilled”, “substantially equivalent”, “same or similar” without providing concrete guidance for what these terms mean in practice. Additionally, the definitions of “OSD-compliant license” and “OSD-conformant terms” are still under active discussion, the conclusion of which we consider a precondition for a proper evaluation of the draft.
Proposed modifications
We have concrete suggestions for addressing these concerns:
Require release of dependent datasets
Require that the dependent datasets for an AI model be released under an open data license. If any of the dependent datasets cannot be distributed for legal or regulatory reasons, a high quality equivalent synthetic dataset can be distributed instead.
To apply this change, replace the “Data information” section with the following:
Data: The data used to train and/or run the system. This includes initial training data as well as any data used to refine, fine-tune, retrain, or otherwise modify the system. If a dataset cannot be distributed for legal or regulatory reasons, a high quality equivalent synthetic dataset may be distributed in its place to meet this requirement. Any substituted synthetic datasets must be clearly marked as synthetic in corresponding documentation for the system.
Additionally, in the checklist section:
- Move the “Data card” to the required section
- Adjust the “Data Information” section of the “Optional Components” table to move the “Training data sets”, “Testing data sets”, “Validation data sets”, and “Benchmarking data sets” into the “Required components” table
Establish a certification process
Establish a certification process (and certification mark) for AI systems looking to be called Open Source AI. This ensures that a defensible position can be reached when (not if) vendors continue to refer to their offerings as “Open Source AI”. This process complements the OSAID, but would need to launch at the same time as version 1.0.
Clarify language
Add explicit sentences to the “Data”, “Code”, and “Model” definitions, under “Preferred form to make “modifications to machine-learning systems”, to make it unambiguous that these must be under OSI or OSD compliant licensing.
- Data: All data must be under OSD-compliant licensing.
- Code: All code must be under OSI-approved licensing, with the exception of code used to perform inference for benchmark tests and/or evaluation code. It is recommended that these items be under OSI-approved licenses, but they are not required.
- Model: The model parameters must be available under OSD-conformant terms. The model architecture must be available under OSI-approved licensing.
Prevent restrictions on outputs of systems
Add prohibitions about restrictions as applied to the outputs of Open Source AI systems to ensure that data generated by the system is not bound by terms that restrict the use of that data for any purpose.
We suggest adding the following text to the OSAID:
An Open Source AI must not impose restrictions on the use, modification, or distribution of output files or data generated by the AI system.
Eliminate optional components
Remove all remaining optional components from the OSAID. To accomplish this, we propose amending the OSAID and deleting everything from “The following components are not required …” downward. This could potentially move to an annotated version (similar to the Annotated OSD).
Address combinatorial ambiguity
We suggest adding:
An Open Source AI can be represented as a distribution of multiple components. If a license is applied across this combination, it must be an OSD-approved license.
after the table of default required components.
We look forward to a discussion of these suggestions.
Sincerely,
julia ferraioli and Tom (Spot) Callaway
AWS Open Source