Proposal to handle Data Openness in the Open Source AI definition [RFC]

You missing a pattern that open a loophole in the whole reasoning:

  1. OpenWashing AI: developers with low-integrity and full IP rights that pretend to comply to the OpenSource AI definition by distributing the system with a different dataset from the one used to train the model, to gain people trust in their weights, legal exemptions and full freedom to inject whatever bias or undetectable backdoor in the released weights (used by most people and organizations) without any accountability.

Also note that moral integrity has always been orthogonal to the open sourceness of a project: think for example to the backdoor implanted over the years in XZ Utils and how the morality of the developers had no impact on the legal status of the released software.

The problem with this approach is that it removes the need for a further “Open Source AI” definition, since we already have “open source software”, “open data” and “freeware”.

“Open Source AI without Open Data” is just “Open Source Software with FreeWare weights”.

Furthermore distinguishing between OSAI D+ and OSAI D- would also pose a huge legal risk on people creating derivative works from OSAI D+ systems in jurisdictions that decide to promote OSAI D+ but not OSAI D-.
In fact, a whistleblower might reveal to the authorities that an AI system that pretended to be OSAI D+ was not trained on the data released, suddenly changing the legal status of the system and of all all derivative works from OSAI D+ and OSAI D-.

The fact that it’s not “Open-enough” if you cannot really study the system because essential data and information are kept secret.
In fact, as a system, it’s not open at all.

The high integrity of the developers in the P quadrant (those who have no IP rights on the dataset they used to train the weights) would prevent them to support an ambiguity that would open wash all AI where the actual data are not shared.