Checklist is available for comments

shujisado · October 10, 2024, 12:22pm

The discussion on “Training data access” is important, but since the OSI has indicated the above focus areas until October 28, let’s gather opinions within this scope.

While recruiting endorsers and preparing for All Things Open may not seem like major topics in this forum at the moment, the Checklist and FAQ are clearly crucial. The Checklist should be a vital document paired with the definition itself, but there hasn’t been much commentary on either Hackmd or this forum. At the very least, we should offer suggestions on how it should be modified to align with the current RC1.

For reference, I’ve extracted the Data and Model sections from the current Checklist below:

Required components	Legal frameworks
Data
At least one of these data components is required, in decreasing order of importance
- Datasets	Available under OSD-compliant license
- Research paper	Available under OSD-compliant license
- Technical report	Available under OSD-compliant license
- Data card	Available under OSD-compliant license
Code
(omit)
Model
All of these components are required
- Model architecture	Available under OSI-approved license
- Model parameters	Available under OSD-conformant terms

gvlx · October 10, 2024, 3:52pm

I’ve always thought the checklist a little confusing and lacking in detail until I read the MOF paper which I thing presents this information in a better view:

6 Model Openness Framework Acceptable Licenses [p.12]

Table 2 provides an overview of acceptable licenses for each component. The table categorizes each component into one of three domains: Data, Model, or both. Additionally, the content type of each component is classified as data, code, or documentation. The table specifies standard open licenses that should be used for releasing each component, while allowing some flexibility for equivalent licenses. By providing a comprehensive scope, the MOF encourages opening the entire pipeline that produces, evaluates, and applies a model. This approach offers multiple perspectives into the model’s inner workings, promoting transparency and reproducibility in open model development

Component Type Recommended Open License

Datasets Data Preferred: CDLA-Permissive-2.0, CC-BY-4.0;Acceptable: Any including unlicensed

Data Preprocessing Code Code Acceptable: OSI-approved

Model Architecture Code Acceptable: OSI-approved

Model Parameters Data Preferred: CDLA-Permissive-2.0;Acceptable: OSI-Approved, Permissive Open Data Licenses

Model Metadata Data Preferred: CDLA-Permissive-2.0;Acceptable: CC-BY-4.0, Permissive Open Data Licenses

Training Code Code Acceptable: OSI-approved

Inference Code Code Acceptable: OSI-approved

Evaluation Code Code Acceptable: OSI-approved

Evaluation Data Data Preferred: CDLA-Permissive-2.0;Acceptable: CC-BY-4.0, Permissive Open Data Licenses

Evaluation Results Documentation Preferred: CC-BY-4.0;Acceptable: Permissive Open Content Licenses

Supporting libraries and Tools Code Acceptable: OSI-approved

Model Card Documentation Preferred: CC-BY-4.0;Acceptable: Permissive Open Content Licenses

Data Card Documentation Preferred: CC-BY-4.0;Acceptable: Permissive Open Content Licenses

Technical Report Documentation Preferred: CC-BY-4.0;Acceptable: Permissive Open Content Licenses

Research Paper Documentation Preferred: CC-BY-4.0;Acceptable: Permissive Open Content Licenses

Sample Model Outputs Data or Code Unlicensed

The separation has a clear purpose, and the checklist has followed it, but as pointed in the comments of the draft, the datasets should have been split in three groups like the code (training, testing and evaluation), as they can be licensed differently.

Instead of the actual licenses (CC, CDLA) and the rather vague “Permissive Open Content”, it was decided to create the terms of “OSD-conformant” and “OSD-Compliant”, which require a clear explanation, as also must be done for the “OSI-approved license”. If that definition already exists elsewhere the document MUST link there AND copy it verbatim, or it MUST BE fully defined here.

gvlx · October 10, 2024, 4:08pm

The crux of the discussion: All components must be required on a fully open system and that is what the Open Source Ai Definition must be.

Later, allowances can be made to close some components, and up to a certain degree.

Mark · October 10, 2024, 6:45pm

Just a note on this bit:

It is abundantly clear that the OSI wants to move on, but I struggle to find constructive readings of the rush to ship a weasel-wordy version in which the issues that elicit most serious and evidence-based pushback from the community remain unaddressed.

And so, I don’t see how we can “get more endorsers” of a definition that is still partly up in the air.

I could perhaps support something like the leaner version proposed by @quaid here, which proposes to dispense with the two murkiest categories of data. As pointed out there, making sth stricter after a first loose version is never going to happen, so this is a now or never moment.

Component	Type	Recommended Open License
Datasets	Data	Preferred: CDLA-Permissive-2.0, CC-BY-4.0;Acceptable: Any including unlicensed
Data Preprocessing Code	Code	Acceptable: OSI-approved
Model Architecture	Code	Acceptable: OSI-approved
Model Parameters	Data	Preferred: CDLA-Permissive-2.0;Acceptable: OSI-Approved, Permissive Open Data Licenses
Model Metadata	Data	Preferred: CDLA-Permissive-2.0;Acceptable: CC-BY-4.0, Permissive Open Data Licenses
Training Code	Code	Acceptable: OSI-approved
Inference Code	Code	Acceptable: OSI-approved
Evaluation Code	Code	Acceptable: OSI-approved
Evaluation Data	Data	Preferred: CDLA-Permissive-2.0;Acceptable: CC-BY-4.0, Permissive Open Data Licenses
Evaluation Results	Documentation	Preferred: CC-BY-4.0;Acceptable: Permissive Open Content Licenses
Supporting libraries and Tools	Code	Acceptable: OSI-approved
Model Card	Documentation	Preferred: CC-BY-4.0;Acceptable: Permissive Open Content Licenses
Data Card	Documentation	Preferred: CC-BY-4.0;Acceptable: Permissive Open Content Licenses
Technical Report	Documentation	Preferred: CC-BY-4.0;Acceptable: Permissive Open Content Licenses
Research Paper	Documentation	Preferred: CC-BY-4.0;Acceptable: Permissive Open Content Licenses
Sample Model Outputs	Data or Code	Unlicensed