I’ve been asking around which AI systems pass the Open Source AI definition (OSAID) but don’t release the training datasets. As many in this forum know, we analyzed a dozen AI systems during the validation phase last year and found that those that qualify as Open Source AI do release the training datasets.
So far, we found that the French PEReN – Center of expertise for digital platform regulation hosts a tool to compare different GenAI systems, using the evaluation criteria contained in the OSAID:
(I think they made a mistake in evaluating the commercial restrictions of Llama, though.)
[UPDATE: the team at PEReN fixed the issue with the Llama 3.1 and 3.2 licenses (3.2 even forbids use to Europeans)]
Also, the startup Oumi made a list of AI systems/models that comply to the OSAID
https://oumi.ai/docs/en/latest/resources/models/supported_models.html
I don’t see any surprises in both of these lists, except a misinterpretation of the licensing terms of Llama (at Peren) and StarCoder2 (at Oumi.)
Did anybody make a list of the exceptions, counter arguments that show the OSAID failing in some way? Did anybody find AI systems in the wild that release training code, the complete data processing code but don’t release the datasets?