Background
This past winter, four (4) working groups were convened to develop the Open Source AI Definition. through the lens of four (4) AI systems with different approaches to openness: Llama 2, Pythia, BLOOM, and OpenCV.
The first task the groups took on was voting for which components should be required for an AI system to be considered open source. The voting report is here. The groups’ recommendations became a requirements checklist in v.0.0.6 of the definition.
Document Review
This Friday, we will release the next version of the definition, v. 0.0.7. This next iteration will include a refined checklist, based on a new working group activity: document review.
For each required component in the v.0.0.6 checklist, we asked working group members – both affiliated and unaffiliated with the AI system – to identify and review the legal documents for that component. The goal was to see how well the documents align with the components as described in the checklist.
The document reviewers are as follows:
Llama 2 | Pythia | BLOOM | OpenCV* |
---|---|---|---|
Affiliated Reviewers: | |||
Jonathan Torres (Meta) | Stella Biderman (EleutherAI) | Danish Contractor (BLOOM Model Governance Workgroup) | (None) |
Davide Testuggine (Meta) | Aviya Skowron (EleutherAI) | ||
Hailey Schoelkopf (EleutherAI) | |||
Unaffiliated Reviewers: | |||
Stefano Zacchiroli (Polytechnic Institute of Paris) | Seo-Young Isabelle Hwang (Samsung) | Jaan Li (University of Tartu, Phare Health) | Rasim Sen (Oasis Software Technology Ltd.) |
Victor Lu (independent database consultant) |
For this task, reviewers identified the relevant legal documents describing rights to study, use, modify, and share each of the required components. They then reviewed those documents and made a determination as to whether the documents provided those rights, writing their findings on this public spreadsheet. The findings were then shared with all working group members for further comment.
Proposed Checklist Updates
Following this review, we propose that the checklist be updated in the following ways in v.0.0.7. Additions are marked in italics. The current version of the checklist can be viewed here.
Required Components | Legal Framework | . | |
---|---|---|---|
Code | |||
Data pre-processing | Available under OSI-compliant license | ||
Training, validation and testing | Available under OSI-compliant license | ||
Inference | Available under OSI-compliant license | ||
Supporting libraries and tools** | Available under OSI-compliant license | ||
Model | |||
Model architecture | Available under OSI-compliant license | ||
Model parameters (including weights) | If assumed copyrightable, available under OSI-compliant license | ||
Documentation | |||
Training methodologies and techniques | Available under open documentation license | ||
Training data scope and characteristics | Available under open documentation license | ||
Training data provenance (including how data was obtained and selected) | Available under open documentation license | ||
Training data labeling procedures, if used | Available under open documentation license | ||
Training data cleaning methodology | Available under open documentation license |
The changes above can be summarized as follows:
- Added legal framework for model parameters (including weights). The framework proposes that, if copyrightable, model parameters can be shared as code
- Added the five (5) data transparency components from v.0.0.6 to the checklist under the category “Documentation,” along with legal frameworks
We look forward to reading your thoughts and questions in the comments as we prepare v.0.0.7 for release this Friday the 12th. Thank you again for being part of this process.
*OpenCV review is not yet complete as of the publication of this post.
**Includes other libraries or code artifacts that are part of the system, such as tokenizers and hyperparameter search code, if used.