Draft v.0.0.9 of the Open Source AI Definition is available for comments

anon18632855 · October 1, 2024, 5:00am

License type is orthogonal: open source includes both permissive and copyleft licenses, and open science demands neither at least for data, per the Model Openness Framework (MOF) (thanks @nick). It talks about OSI Approved licenses and open data licenses being more appropriate for data — collectively open licenses.

“In the MOF, Class III is the entry point and contains the minimum required components that must be released using open licenses.” It calls for the model architecture (code), [final] parameters (weights), and some documentation, which is barely useful for inference presuming you trust the vendor… and you can work out how to do inference yourself without the inference code! It’s basically the Llama model, and by not requiring the training code let alone data, it doesn’t even meet the low bar of @lumin’s ToxicCandy category, making it decidedly NonFree.

The MOF erroneously claims “Class III contains all the components required to study, modify, redistribute, and build upon a model without restrictions”, quickly walking that back: “However, this class lacks completeness and robustness for full reproducibility and the transparency needed to confirm all claims made by the producer. It also lacks sufficient components to evaluate the model, including the training data.” Either you do or you don’t so it’s hard to believe this isn’t errata.

Draft 0.0.9 and its checklist do make it into the ToxicCandy category however, as they demand training code without its data dependency: “Here’s where we dump the database of public Facebook/Instagram posts into the pot, but you can’t have access to those because reasons”. This maps directly to the MOF’s Class II Open Tooling as “an intermediate step between an open model and open science, providing a model consumer with information to test a model producer’s assertions”. It falls far short of protecting the four freedoms.

Class I Open Science adds all the datasets under “any license or unlicensed”. I’m replicating the MOF’s list below because this point is that important. They even repeat it themselves in section 6:

Accepted Open License (Datasets)
Preferred: CDLA-Permissive-2.0, CC-BY-4.0
Acceptable: Any including unlicensed

Even the MOF’s highest level of openness and completeness does not demand open licenses for the data. This is the only compromise we can safely make while still protecting the four freedoms, and it seems entirely reasonable especially given the precedent for exceptions (e.g., LGPL). It does give rise to an even higher Class 0 that requires open licenses for data, but that may be a job for the FSF as I think even Debian should consider coming to terms with this fact of life!

The MOF “acknowledges that most pre-training data is subject to copyright and therefore it is not possible to license the data. To this end, datasets are an optional component, with the caveat that datasets must be included for Class I (with any or no license).” This also addresses @quaid’s last post on the reality of data licensing. The key for data is access, “with datasets expected to be readily available without personal requests or paywalls, promoting transparency and enabling scrutiny.”

Fortunately, the MOF is prescriptive as to what is required of us to protect the four freedoms:

To achieve full transparency, reproducibility, and extensibility, we argue that model producers must go beyond just releasing their model and the trained weights and biases, which is currently the norm. Instead, they should include all artifacts of their work, including datasets for training, validation, testing, and benchmarking, as well as detailed documentation, such as research papers, model cards, data cards, and any usage documentation. Completeness also requires all code used to parse and process data, the code used for training and inference, and any code used in benchmark tests, along with any libraries or other code artifacts that were a part of the model development lifecycle.

Note that it includes both testing and training data (among others), per @stefano’s recent thread on the subject: Should testing data be made available to use an AI system?

Here’s the list of requirements from the paper:

Class I. Open Science [~= Open Source]
- Research Paper
- Datasets (any license or unlicensed)
- Data Preprocessing Code
- Model Parameters (intermediate checkpoints)
- Model Metadata (optional)
- And all Class II Components
Class II. Open Tooling [~= Toxic Candy]
- Training Code
- Inference Code
- Evaluation Code
- Evaluation Data
- Supporting Libraries & Tools (optional)
- And all Class III Components
Class III. Open Model [~= Open Weights]
- Model Architecture
- Model Parameters (final checkpoint)
- Technical Report
- Evaluation Results
- Model Card
- Data Card
- Sample Model Outputs (optional)