Report of working group document review

Background

This past winter, four (4) working groups were convened to develop the Open Source AI Definition. through the lens of four (4) AI systems with different approaches to openness: Llama 2, Pythia, BLOOM, and OpenCV.

The first task the groups took on was voting for which components should be required for an AI system to be considered open source. The voting report is here. The groups’ recommendations became a requirements checklist in v.0.0.6 of the definition.

Document Review

This Friday, we will release the next version of the definition, v. 0.0.7. This next iteration will include a refined checklist, based on a new working group activity: document review.

For each required component in the v.0.0.6 checklist, we asked working group members – both affiliated and unaffiliated with the AI system – to identify and review the legal documents for that component. The goal was to see how well the documents align with the components as described in the checklist.

The document reviewers are as follows:

Llama 2 Pythia BLOOM OpenCV*
Affiliated Reviewers:
Jonathan Torres (Meta) Stella Biderman (EleutherAI) Danish Contractor (BLOOM Model Governance Workgroup) (None)
Davide Testuggine (Meta) Aviya Skowron (EleutherAI)
Hailey Schoelkopf (EleutherAI)
Unaffiliated Reviewers:
Stefano Zacchiroli (Polytechnic Institute of Paris) Seo-Young Isabelle Hwang (Samsung) Jaan Li (University of Tartu, Phare Health) Rasim Sen (Oasis Software Technology Ltd.)
Victor Lu (independent database consultant)

For this task, reviewers identified the relevant legal documents describing rights to study, use, modify, and share each of the required components. They then reviewed those documents and made a determination as to whether the documents provided those rights, writing their findings on this public spreadsheet. The findings were then shared with all working group members for further comment.

Proposed Checklist Updates

Following this review, we propose that the checklist be updated in the following ways in v.0.0.7. Additions are marked in italics. The current version of the checklist can be viewed here.

Required Components Legal Framework .
Code
Data pre-processing Available under OSI-compliant license
Training, validation and testing Available under OSI-compliant license
Inference Available under OSI-compliant license
Supporting libraries and tools** Available under OSI-compliant license
Model
Model architecture Available under OSI-compliant license
Model parameters (including weights) If assumed copyrightable, available under OSI-compliant license
Documentation
Training methodologies and techniques Available under open documentation license
Training data scope and characteristics Available under open documentation license
Training data provenance (including how data was obtained and selected) Available under open documentation license
Training data labeling procedures, if used Available under open documentation license
Training data cleaning methodology Available under open documentation license

The changes above can be summarized as follows:

  • Added legal framework for model parameters (including weights). The framework proposes that, if copyrightable, model parameters can be shared as code
  • Added the five (5) data transparency components from v.0.0.6 to the checklist under the category “Documentation,” along with legal frameworks

We look forward to reading your thoughts and questions in the comments as we prepare v.0.0.7 for release this Friday the 12th. Thank you again for being part of this process.


*OpenCV review is not yet complete as of the publication of this post.

**Includes other libraries or code artifacts that are part of the system, such as tokenizers and hyperparameter search code, if used.

4 Likes

If assumed copyrightable is a crucial piece. We know that model parameters are being distributed as code under the ASL2.0 and other licenses that we know work well with copyrighted material.

But it’s already clear that not everybody agrees that model parameters are covered by copyright. This means that we should look for a more general way to describe the conditions of availability of the parameters.

These match to the ‘data transparency’ section of the Definition; we’ll fix that in 0.07.

This I believe needs a quick discussion: what exactly does it mean to have access to these components? Is the mere availability (the user can read them?) enough? Is it necessary for the user to be able to copy and modify some of these? If so, shall we simply use the Open Source Definition as the framework to review these components? (Creative Commons licenses haven’t been reviewed by OSI because they don’t apply to software but I bet most of them would pass, except the Non-Commercial ones.)

1 Like

Model parameters (including weights) If assumed copyrightable, available under OSI-compliant license

What does saying explicitly “if assumed copyrightable” add here? The OSD doesn’t contain that hypothethical, and still everyone considers that software source code that is not copyrightable (e.g., because it’s just non-copyrightable data embedded in code) is open source software according to the OSD.

I have the feeling that if we just leaves out that “if…”, the result would be exactly the same.

Keeping it in, on the other hand, will be an endless source of discussions and complexity. (Starting from the fact that weights could end up being considered copyrightable in some jurisdictions but not others, what then?)

Available under open documentation license

What’s the definition of an “open documentation license”?

Here too, the OSD does not say anything specific about documentation shipped with code, and that is fine for everyone. Why should we treat documentation of “AI systems” any different? Can’t we just require documentation to be available under an OSI-compatible license? I don’t see any practical drawback in doing that.

My 2 cents

I agree with zack. What about AI is sufficiently different from software that documentation is a required component? This is very far afield from the OSD, so I’d like to understand why it needs to be considered a necessary component of an AI model.

1 Like

Hi @pchestek :wave: I think this will be clearer in v.0.0.7, when the documentation section is retitled “Data Transparency.” Data is central to AI systems in a way that is not so for software. The need to account for data, even though datasets themselves are not required by the Open Source AI Definition, is the reason for requiring data transparency documentation.

the text is not to be used in the Definition, it’s only for the reporting of the working groups.

FYI, v. 0.0.7 is now live: The Open Source AI Definition - 0.0.7 - HackMD

Stefano will post the official announcement soon.