Replicating the data section from this post:
An Open Source AI Definition to achieve its goals of modifiability, an AI system must include the data used to train the system. We are aware of the challenges that this poses for the definition, but the very Model Openness Framework the current definition references states that full transparency and reproducibility requires the release all datasets used to train, validate, test, and benchmark. For AI systems, data is the equivalent of source code, and we explicitly require that source code must be obtainable for software to qualify as open source. The current definition marks these as āoptionalā.
Where inclusion of datasets poses a privacy or legal risk, we suggest the use of equivalent synthetic data to meet this requirement, where the synthetic data achieves comparable results when training the model.
The required components in the Data Information section are not sufficient for someone to modify an AI system as defined. (Note: modification means changing the system before a model is trained, and therefore is more in-depth than fine-tuning, transfer learning, or similar techniques.) Inclusion of data sets is listed as optional, which means that the section might as well be elided. In fact, there is no requirement than the data used to train an Open Source AI system be licensed under an open license at all, unless the maintainer plans to publish them.
In this, the OSAID fails to meet the necessary high bar to ensure a practical and inclusive standard for Open Source AI. Practically, the OSAID is worded this way so that AI systems can be considered āOpen Source AIā without having to publish the dependent data.
We put forth the call to require the inclusion of original training data sets in order to be called āOpen Source AIā. When thatās not possible for the reasons outlined above, provide the alternative of synthetic data alongside a justification for not releasing the original data sets.