The discussion on “Training data access” is important, but since the OSI has indicated the above focus areas until October 28, let’s gather opinions within this scope.
While recruiting endorsers and preparing for All Things Open may not seem like major topics in this forum at the moment, the Checklist and FAQ are clearly crucial. The Checklist should be a vital document paired with the definition itself, but there hasn’t been much commentary on either Hackmd or this forum. At the very least, we should offer suggestions on how it should be modified to align with the current RC1.
For reference, I’ve extracted the Data and Model sections from the current Checklist below:
Required components
Legal frameworks
Data
At least one of these data components is required, in decreasing order of importance
I’ve always thought the checklist a little confusing and lacking in detail until I read the MOF paper which I thing presents this information in a better view:
6 Model Openness Framework Acceptable Licenses [p.12]
Table 2 provides an overview of acceptable licenses for each component. The table categorizes each component into one of three domains: Data, Model, or both. Additionally, the content type of each component is classified as data, code, or documentation. The table specifies standard open licenses that should be used for releasing each component, while allowing some flexibility for equivalent licenses. By providing a comprehensive scope, the MOF encourages opening the entire pipeline that produces, evaluates, and applies a model. This approach offers multiple perspectives into the model’s inner workings, promoting transparency and reproducibility in open model development
Component
Type
Recommended Open License
Datasets
Data
Preferred: CDLA-Permissive-2.0, CC-BY-4.0;Acceptable: Any including unlicensed
Data Preprocessing Code
Code
Acceptable: OSI-approved
Model Architecture
Code
Acceptable: OSI-approved
Model Parameters
Data
Preferred: CDLA-Permissive-2.0;Acceptable: OSI-Approved, Permissive Open Data Licenses
Model Metadata
Data
Preferred: CDLA-Permissive-2.0;Acceptable: CC-BY-4.0, Permissive Open Data Licenses
Training Code
Code
Acceptable: OSI-approved
Inference Code
Code
Acceptable: OSI-approved
Evaluation Code
Code
Acceptable: OSI-approved
Evaluation Data
Data
Preferred: CDLA-Permissive-2.0;Acceptable: CC-BY-4.0, Permissive Open Data Licenses
Evaluation Results
Documentation
Preferred: CC-BY-4.0;Acceptable: Permissive Open Content Licenses
Supporting libraries and Tools
Code
Acceptable: OSI-approved
Model Card
Documentation
Preferred: CC-BY-4.0;Acceptable: Permissive Open Content Licenses
Data Card
Documentation
Preferred: CC-BY-4.0;Acceptable: Permissive Open Content Licenses
Technical Report
Documentation
Preferred: CC-BY-4.0;Acceptable: Permissive Open Content Licenses
Research Paper
Documentation
Preferred: CC-BY-4.0;Acceptable: Permissive Open Content Licenses
Sample Model Outputs
Data or Code
Unlicensed
The separation has a clear purpose, and the checklist has followed it, but as pointed in the comments of the draft, the datasets should have been split in three groups like the code (training, testing and evaluation), as they can be licensed differently.
Instead of the actual licenses (CC, CDLA) and the rather vague “Permissive Open Content”, it was decided to create the terms of “OSD-conformant” and “OSD-Compliant”, which require a clear explanation, as also must be done for the “OSI-approved license”. If that definition already exists elsewhere the document MUST link there AND copy it verbatim, or it MUST BE fully defined here.
It is abundantly clear that the OSI wants to move on, but I struggle to find constructive readings of the rush to ship a weasel-wordy version in which the issues that elicit most serious and evidence-based pushback from the community remain unaddressed.
And so, I don’t see how we can “get more endorsers” of a definition that is still partly up in the air.
I could perhaps support something like the leaner version proposed by @quaidhere, which proposes to dispense with the two murkiest categories of data. As pointed out there, making sth stricter after a first loose version is never going to happen, so this is a now or never moment.