This is such a great question, it deserves its own thread.
I’ve learned that the word “reproducibility” is loaded and may not be the correct one to use. By reading the comment you left on HackMD, you’re thinking of requiring enough information to build an equivalent dataset. I think that’s one of the objectives of the data transparency requirement.
I’d like to hear more experts chime in. Draft 0.0.6 says:
- Data : Sufficiently detailed information on how the system was trained, including the training methodologies and techniques, the training data sets used, information about the provenance of those data sets, their scope and characteristics; how the data was obtained and selected, the labeling procedures and data cleaning methodologies
This is really just a transparency requirement, using words from the EU AI Act, which Open Future already criticized for not going far enough.
What if the requirement in the Open Source AI Definition was more like:
- Data: Sufficiently detailed information to allow someone to replicate the entire dataset.
How would this statement work with all the different ways to train machine learning systems, in general, including systems using federated learning, assisted learning and other techniques where the dataset cannot be replicated?
Minor questions:
this is a known issue I mentioned it here.
The definition is not aimed at ordinary people. The exact definition of the components is in the Model Openness Framework paper.