List of unaddressed issues of OSAID RC2

Shamar · October 9, 2024, 2:40pm

Let’s try to keep a list of well known issues that are still unaddressed at the RC2 version of the Open Source AI definition proposed by OSI.

Prohibition to study: according to everybody (including @stefano): “data is essential for understanding and studying the system”, but the OSAID RC1 doesn’t require their availability, de facto allowing builders to prevent any meaningful study of the system.
(reported too many times to list them all, the first time 2 years ago by @lumin and few months ago by @juliaferraioli)
Limited Modificability: without access to the data used in training (and in cross validation) your ability to modify an AI system is severely reduced.
(reported here and confirmed in OSI announce by listing just three “things that a fork may achieve” (in a limited set of systems) without training data)
Open Washing AI: any definition that a Toxic Candy could pass would both damage the credibility the whole open source ecosystem, and plant a huge loophole in the European AI Act. (reported here, here, here, here, here, here, …)
Inherent user (in)security: without access to the whole training data, it’s possible to plant undetectable backdoors in machine learning Models. (reported here)
Implicit or Unspecified formal requirements: if ambiguities in the OSAID will be solved for each candidate AI system though a formal certificate issued by OSI, such formal requirement should be explicitly stated in the OSAID. (reported here and here)
OSI as a single point of failure: since each new version of each candidate Open Source AI system world wide should undergo to the certification process again, this would turn OSI to a vulnerable bottleneck in AI development, that would be the target of unprecedented lobbying from the industry. (reported here and here)
Unknown “OSI-approved terms”: the definition requires the code distribution under “OSI approved licenses”, but requires that Data Information and Weights to be distributed under “OSI approved terms”. Nobody knows these terms, and this pose issues “critical legal concerns for developers and bsinesses”. (reported here)
Inability to use OSAI in Science: no AI black box can be used in scientific research, both because results might not be explainable and because scientists might be fooled by biases intentionally planted in the unshareable data. This lead to the paradox that Open Source AI could not be used by researchers without invalidating their results. (reported here)
Paywalled data: the current definition allow the Open Source AI to be trained on data that can be “obtained for a fee”. This turns the four freedoms into privileges for the rich and may pose a legal burden over the their use and distribution in certain countries. (reported here, here and here)
Underspecified “substantial equivalence”: the definition requires a skilled person to be able to build a “substantially equivalent” system out of the components of an Open Source AI system, but it doesn’t explain (not even in FAQs) what such equivalence does means.
In Computer Science, two software are equivalent if and only if for any given input they produce the same output in a comparable amount of time, but OSI have not specified what such equivalence should mean in their understanding of AI systems. (reported here, here, here)
Arbitrary (lack of) transparency: describing the unshareable data allowed in a “open” source AI, the FAQs states that “the ability to study some of the system’s biases demands a detailed description of the data”, but it doesn’t explain how to identify all the bias reproduced by the system and who decide which of such bias will have to be studied (nor how one could ever study them without the training data, obviously).
(reported here)

The goal of this thread is to keep track of the existing issues and ~~strike~~ them when solved, so that we can easily visualize progress (or regressions) while OSI rushes towards the 1.0 definition.

So please, if I forgot any unaddressed issue, don’t be shy to add it in a reply.

cora · October 9, 2024, 9:35pm

The faq are ambiguous about unshareable non-public training data:

For this class of data, the ability to study some of the system’s biases demands a detailed description of the data – what it is, how it was collected, its characteristics, and so on – so that users can understand the biases and categorization underlying the system.

It struck me that people are only allowed to study “some biases” of the system.

How many? Which ones?
Who decides in advance which biases we are allowed to study or not?

lumin · October 10, 2024, 7:31am

I simply find it interesting that 6 out of the 10 points from the original post are “Training data access”. We have been arguing on this from the very beginning (I was already arguing about this in the internal mailing list where OSI started the co-design process of OSAID).

It’s very clear what matters. And it’s also very clear what is not yet addressed.

cora · October 11, 2024, 6:28pm

“No hay peor sordo que el que no quiere oír”.