The Open Source AI Definition used in the wild

stefano · February 20, 2025, 10:35am

I’ve been asking around which AI systems pass the Open Source AI definition (OSAID) but don’t release the training datasets. As many in this forum know, we analyzed a dozen AI systems during the validation phase last year and found that those that qualify as Open Source AI do release the training datasets.

So far, we found that the French PEReN – Center of expertise for digital platform regulation hosts a tool to compare different GenAI systems, using the evaluation criteria contained in the OSAID:

~~(I think they made a mistake in evaluating the commercial restrictions of Llama, though.)~~
[UPDATE: the team at PEReN fixed the issue with the Llama 3.1 and 3.2 licenses (3.2 even forbids use to Europeans)]

Also, the startup Oumi made a list of AI systems/models that comply to the OSAID

https://oumi.ai/docs/en/latest/resources/models/supported_models.html

I don’t see any surprises in both of these lists, except a misinterpretation of the licensing terms of Llama (at Peren) and StarCoder2 (at Oumi.)

Did anybody make a list of the exceptions, counter arguments that show the OSAID failing in some way? Did anybody find AI systems in the wild that release training code, the complete data processing code but don’t release the datasets?

gnusupport · February 26, 2025, 5:56pm

When comparing language models like Qwen 2.5, it’s crucial to specify the exact model and version to ensure accuracy. For instance, Qwen 2.5 encompasses various models, each with its own set of features and licensing terms. Similarly, the comparison to DeepSeek’s models should also be precise, considering the specific model and version. Researchers must provide detailed information to support their claims, ensuring that the facts presented are accurate and reliable. This approach helps maintain the credibility of the comparison and ensures that the information is well-researched and accurate.

For now, that research is just misleading