Checklist is available for comments

The discussion on “Training data access” is important, but since the OSI has indicated the above focus areas until October 28, let’s gather opinions within this scope.

While recruiting endorsers and preparing for All Things Open may not seem like major topics in this forum at the moment, the Checklist and FAQ are clearly crucial. The Checklist should be a vital document paired with the definition itself, but there hasn’t been much commentary on either Hackmd or this forum. At the very least, we should offer suggestions on how it should be modified to align with the current RC1.

For reference, I’ve extracted the Data and Model sections from the current Checklist below:

Required components Legal frameworks
Data
At least one of these data components is required, in decreasing order of importance
- Datasets Available under OSD-compliant license
- Research paper Available under OSD-compliant license
- Technical report Available under OSD-compliant license
- Data card Available under OSD-compliant license
Code
(omit)
Model
All of these components are required
- Model architecture Available under OSI-approved license
- Model parameters Available under OSD-conformant terms

I’ve always thought the checklist a little confusing and lacking in detail until I read the MOF paper which I thing presents this information in a better view:

6 Model Openness Framework Acceptable Licenses [p.12]

Table 2 provides an overview of acceptable licenses for each component. The table categorizes each component into one of three domains: Data, Model, or both. Additionally, the content type of each component is classified as data, code, or documentation. The table specifies standard open licenses that should be used for releasing each component, while allowing some flexibility for equivalent licenses. By providing a comprehensive scope, the MOF encourages opening the entire pipeline that produces, evaluates, and applies a model. This approach offers multiple perspectives into the model’s inner workings, promoting transparency and reproducibility in open model development

Component Type Recommended Open License
Datasets Data Preferred: CDLA-Permissive-2.0, CC-BY-4.0;Acceptable: Any including unlicensed
Data Preprocessing Code Code Acceptable: OSI-approved
Model Architecture Code Acceptable: OSI-approved
Model Parameters Data Preferred: CDLA-Permissive-2.0;Acceptable: OSI-Approved, Permissive Open Data Licenses
Model Metadata Data Preferred: CDLA-Permissive-2.0;Acceptable: CC-BY-4.0, Permissive Open Data Licenses
Training Code Code Acceptable: OSI-approved
Inference Code Code Acceptable: OSI-approved
Evaluation Code Code Acceptable: OSI-approved
Evaluation Data Data Preferred: CDLA-Permissive-2.0;Acceptable: CC-BY-4.0, Permissive Open Data Licenses
Evaluation Results Documentation Preferred: CC-BY-4.0;Acceptable: Permissive Open Content Licenses
Supporting libraries and Tools Code Acceptable: OSI-approved
Model Card Documentation Preferred: CC-BY-4.0;Acceptable: Permissive Open Content Licenses
Data Card Documentation Preferred: CC-BY-4.0;Acceptable: Permissive Open Content Licenses
Technical Report Documentation Preferred: CC-BY-4.0;Acceptable: Permissive Open Content Licenses
Research Paper Documentation Preferred: CC-BY-4.0;Acceptable: Permissive Open Content Licenses
Sample Model Outputs Data or Code Unlicensed

The separation has a clear purpose, and the checklist has followed it, but as pointed in the comments of the draft, the datasets should have been split in three groups like the code (training, testing and evaluation), as they can be licensed differently.

Instead of the actual licenses (CC, CDLA) and the rather vague “Permissive Open Content”, it was decided to create the terms of “OSD-conformant” and “OSD-Compliant”, which require a clear explanation, as also must be done for the “OSI-approved license”. If that definition already exists elsewhere the document MUST link there AND copy it verbatim, or it MUST BE fully defined here.

The crux of the discussion: All components must be required on a fully open system and that is what the Open Source Ai Definition must be.

Later, allowances can be made to close some components, and up to a certain degree.

1 Like

Just a note on this bit:

It is abundantly clear that the OSI wants to move on, but I struggle to find constructive readings of the rush to ship a weasel-wordy version in which the issues that elicit most serious and evidence-based pushback from the community remain unaddressed.

And so, I don’t see how we can “get more endorsers” of a definition that is still partly up in the air.

I could perhaps support something like the leaner version proposed by @quaid here, which proposes to dispense with the two murkiest categories of data. As pointed out there, making sth stricter after a first loose version is never going to happen, so this is a now or never moment.

2 Likes