Hi @nick, here a few concerns that should be added to the list:
- Inherent user (in)security: without access to the whole training data, it’s possible to plant undetectable backdoors in machine learning Models.
- Implicit or Unspecified formal requirements: if ambiguities in the OSAID will be solved for each candidate AI system though a formal certificate issued by OSI, such formal requirement should be explicitly stated in the OSAID.
- OSI as a single point of failure: since each new version of each candidate Open Source AI system world wide should undergo to the certification process again, this would turn OSI to a vulnerable bottleneck in AI development, that would be the target of unprecedented lobbying from the industry.
- Open Washing AI: any definition that a black box could pass would both damage the credibility the whole open source ecosystem, and open a huge loophole in the european normative (the AI Act).
To keep the concerns definition as concise as possible (as @nick requested), I add here the underlying arguments and references to other relevant threads.
Inherent user (in)security
Since Hearthbleed we know that Open Source Software can become a vehicle for overlooked backdoors, and XZ Utils reminded us that OSS based supply chain attacks are common and mostly undetected.
However, the freedom to study the source code let us identify them, learn how they were introduced and effectively fix them by studying how the executable match the declared source.
Cryptographers already proved that you can plant undetectable backdoors in ML models but it’s much easier to plant undetectable bias against certain marginalized groups.
It’s up to us to provide a definition that leads to a secure and safe environment for users of Open Source AIs.
Note that this security concern is related to some of other concerns (reproducibility, versioning and data transparency) by admitting a simple solution (mandatory availability of training data), but does not overlap in the consequences, such as large scale automated discrimination, undetectable mass-surveillance, large scale expoinage and so on.
Implicit or Unspecified formal requirements
Given the history and the license review process I’m a bit surprised to read that OSI is planning to become a sort of AI System Certification Authority through the Open Source AI definition.
So much that I’m afraid I misunderstood @stefano’s framing of the matter.
In another thread @shujisado argued about the ambiguities of OSD and OSAID
I’d argue that we could leverage OS history to remove all the ambiguities of the definition (or at least to minimize them to new and unpredictable corner cases), I see the simple appeal of a centralized “benevolent dictator” to solve ambiguities on a case by case (and version by version) basis.
However, such a centralized authority would not be analogue to a Justice system that is inherently decentralized, where several independent Judges relay on the Law and their own experience and culture to independently evaluate each case.
So if Open Source AI is what OSI certify as Open Source AI, such formal requirement should be explicit in the Open Source AI definition, eg in a new final section like this:
OSI Certification
OSI will be responsible to certify the compliance of each candidate AI system to the definition above.
- For example, when a new version of an AI system is released with different weights, a skilled person at OSI will recreate a substantially equivalent system using the same or similar data, to verify that the Data information requirement still hold.
OSI as a single point of failure
As @jberkus pointed out software certification is not a easy task that can be easily and effectively fulfilled by volunteers.
Even just the amount of documentation and bureaucracy needed to prove that the process was properly executed, would be overwhelming for volunteers donating their free time.
Also, a legal evaluation of the licensing of the various components would not be enough to verify “that a skilled person can recreate a substantially equivalent system using the same or similar data”, but you’d need at least one skilled person, equivalent datacenters and energy to effectively check that the Data Information requirement is satisfied, by recreating the system from scratch and verifying that it’s “substantially equivalent”.
@jberkus concludes
But I wonder if such a setup would turn OSI into a huge bottleneck and single point of failure of the ecosystem. AI systems would compete for OSI resources, with larger models requiring more verification labor and larger datacenters and smaller models waiting for their application to be taken in account after the larger (and likely more influential ones).
This would also turn OSI into a center of pressure from the most powerful lobbying groups around the world, as @Mark pointed out, with all that it follows.
Open Washing AI
As @Mark pointed out
The recent Zuckerberg’s blog post confirms that he’s not going to wait for an OSI certificate to pretend that LLama 3.1 is “Open Source AI”, just “to escape some of the most onerous requirements of technical documentation and the attendant scientific and legal scrutiny”, as predicted by the FAccT paper.
Now, it would be very easy for Meta to tweak the license, provide a few sintetic dataset that does not show any bias or surveillance backdoor to get a OSI stamp with the the current draft (0.0.9) of the Open Source AI definition.
While some might call such outcome as a OSI success (Zuckerberg surely would ), having OSI approved black boxes that formally match the OpenSource AI definition but nobody could really study or modify (fine-tuning is basically tweaking config) would damage every single attempt to create a truly transparent AI system, by exposing it to the unfair competition of an open washed alternative.
Also, the loophole in the AI Act would be huge: the AI Act exempted free and open source systems from detailed technical documentation and “scientific and legal scrutiny” because they are expected to be fully transparent.
But the adoption of an Open Source AI definition that can be applied to black box distributed with a facade dataset, would distort the application of the AI Act.
It’s like if OSI injected a vulnerability in the AI Act and sold it to Meta, Open AI, Google and their friends…
Primum non nocere
If we cannot provide an Open Source AI definition that excludes black boxes, it’s better to avoid any official definition at all, so that users (and Judges in court) won’t be fooled by it.