Apologies for not contributing to the discussion earlier, but I’m reposting my comment here as I worry that while the definition (which should probably be labeled as such rather than “What is Open Source AI”) requires that users be able to “modify the system for any purpose” (which is implied in the DFSG and implemented in its terms), the checklist makes the requisite inputs for said modifications (i.e., training data) optional but “appreciated”.
The purpose of DFSG’s source code provision (“The program must include source code, and must allow distribution in source code as well as compiled form.”) is to enable users to modify behaviour and distribute the results in source (i.e., training data) and “compiled” (i.e., model weights) form.
It’s one thing to be able to deploy a model for inference — and indeed there’s little point in distributing one without permission to use it — and another altogether to have the freedom to change it, for example by transforming, reducing, or expanding the training data.
By making training, testing, and validation data set optional but “appreciated”, this freedom is not protected; it’s the AI equivalent of freeware distributed without source code.
Granted, most models will not meet the definition, but most software is proprietary rather than open source. An example of a model that should meet the definition is one trained on Wikipedia, itself “available under open documentation license” (CC).