How to "level the playing field" with the Open Source AI definition?

Shamar · September 28, 2024, 1:32am

During yesterday’s Town Hall, @stefano sheds light over the goals of the OSI’s board in the OSAID definition process:

To level the playing field so that new and innovative Open Source AI system can compete with the large corporations, is a worth ambition.

The “right to fork”

To this aim, OSI want Open Source AI definition to grant, he stated several times during his conversation with @anon18632855, “the right to fork”. Such approach is confirmed in the new FAQs recently shared by @nick that state:

Open Source means giving anyone the ability to meaningfully fork

However such new freedom is not listed in the preamble, that include the freedom to study and the freedom to modify instead.

To fork an AI system, according to @stefano you don’t need the training data because you can use your own.
However, as even @stefano acknowledges, most experts, researchers and AI practitioners agree that you need training data to meaningfully study an AI system (and several agree that you need such data also to be able to completely modify the system, too).

So we have a first issue here: the OSAID preamble should be replaced with something along the line:

An Open Source AI is an AI system made available under terms and in a way that grant the freedoms to:

Use the system for any purpose and without having to ask for permission.

Fork the system for fine-tuning.

Share the system for others to use with or without modifications, for any purpose.

These freedoms apply both to a fully functional system and to discrete elements of a system. A precondition to exercising these freedoms is to have access to the preferred form to make modifications to the system.

The rest of the OSAID 0.0.9 would be coherent with such preamble, because it does not pretend to grant the right to study or to modify the system.

So if the training data are not going to be required, I suggest to fix the preamble in the Release Candidate before people get accidentally fooled: you can only study an Open Source AI enough to fork it, not to understand it.

Validating OSI hypothesis decision

However @stefano also said that while trying to “validate” their “hypothesis”, they noticed an interesting pattern:

That’s a very good question, because this show that if the OSAID mandates the release (or at least the availability) of the training datasets, existing Open Source AI won’t be affected.

Indeed such simple observation alone, falsify the “hypothesis” that Open Source AI practitioners need to retain the datasets.

So maybe we shouldn’t talk about an “hypothesis” to test, but about a political decision to justify, as @zack believed and @stefano confirmed during the Town All (see recording at 51m 61s).

Cui prodest?

Who would get a competitive advantage from such political decision by not releasing large datasets while posing as “open source”?

Not the existing Open Source AI system, that already distribute the source data with the source code.

Exactly: such political decision benefits most those “large corporations who already have all the data they want, and the means to get more”.

They will be able to not share their training data, while distributing their black boxes as “Open Source AI”.

So instead of leveling the playing field, the OSI political decision to not mandate training data, not only inhibits any meaningful study of the “Open Source” AI systems, but provides further advantages to existing big players (more on this later).

Any OSAI definition that does not require the availability of training data will not increase the data available to create truly Open Source AI systems, but will expose them to the unfair competition of larger corporations passing AI systems as “Open Source AI” without granting half of the freedoms that the definition pretend to protect.

4 data types… to rule them all

During the Town Hall, @stefano also clarified that

To this aim, the new FAQ distinguish 4 kinds of training data:

Open: basically, CC, OSD and so on
Public: available for free (typically on the internet)
Obtainable: data that can be obtained, including for a fee.
Unshareable: non-public data that cannot be shared

The first two are mostly uncontroversial.

But think about “obtainable” datasets: how would you call a “freedom” that only the richest can exercise by paying a fee?
Right! It’s not a freedom, but a privilege.

So by allowing “obtainable training data” in Open Source AI, OSI excludes the poor from the exercise of the freedom to study and the freedom to modify, raising digital divide despite knowing that the cost of the required hardware will drop as competition arise.

So as it currently stand, OSI should replace the word “freedom” with “privilege” in the Open Source AI definition.

Open Washing black boxes

Then come the Unshareable dataset.

We know that the AI Act that imposes “onerous requirements of technical documentation and the attendant scientific and legal scrutiny” to all AI systems… except for free and open source ones!

Why such exception?

Not because you can fork an open source system, but specifically because of the freedom to study the system that FLOSS traditionally granted to every user and to the authorities.

Such complete transparency, expected from Open Source is the obvious reason why the AI Act does not impose further “onerous requirements of technical documentation and the attendant scientific and legal scrutiny”: they are simply not needed to protect people rights!

I mean, if it’s open you can look inside!

But what if the OSI “hypothesis” enable black boxes to escape legal and scientific scrutiny while including “unsharable training data”?

The users, however skilled, won’t be able to study and modify such systems and a law designed to protect people and mitigate the social risks of AI black boxes will be systematically circumvented.
Did we learn anything from Boeing 747 MAX?

Again, who would get a competitive advantage over them by not releasing such large datasets while posing as “open source”?

Exactly: the large corporations who already have all the data they want, and the means to get more.

Fun fact: OSI claims that the OSAID shouldn’t mandate training data to combat open washing!

But wasn’t the OSI “hypothesis” designed to level the playing field?

The mythical medical AI

Very well: let suppose that a non-profit organization collects a large dataset of properly anonymized medical records with explicit consent of data subjects to use them for AI research and to redistribute them under a OSD-compliant license.

Let suppose that such organization develops a powerful medical AI model and release everything, including the properly anonymized data, effectively granting the users all of the four freedoms.

Why such an AI system should compete in the same playing field of an unaccountable black box developed by a large corporation illegally collecting data without user consent?

This is not an hypothetical scenario: for 3 years after the Schrems II ruling, Google, Meta, Microsoft and several other US companies kept collecting European data in blatant violation of the GDPR, and a lot of such data (all of them?) went into training AI models for various purposes.

So who benefits most from a definition of Open Source AI that does not require the availability training data?

For sure not people, whose trust in open source get (ab)used to avoid legal and scientific scrutiny.

For sure not the wider Open Source community, that, as @stefano noticed, already share training data.

So who? You got it.

An alternative hypothesis

Let’s now suppose that, instead, the Open Source AI mandate the availability of training data.
In OSI classification, this means that it would only allow Open and Public data to be used in training.

What would be the effects of such political decision?

No one, large corporation or not, will be able to evade legal and scientific scrutiny by open washing their AI system: instead everybody will really need to grant the freedom to study their system to everybody, if they want to leverage the AI Act exception.
Such data will have to be collected without violating any right of third parties (copyright, personal data protection and so on) in the first place. This would also protect users of such systems, that frequently output large extract of their training data (but also this).
Everybody will be able to study such AI systems and fully modify them.
Researchers will be able to detect bias against minorities before they cause too much harm and security expert will be able to detect backdoors planted in the process, raising the security of the whole Open Source AI ecosystem.
Everybody will also be able to leverage the datasets released by the large corporations to create new and better Open Source AI systems.
They won’t release the dataset to please open source nerds, but to leverage the exceptions granted by the AI Act!

What about the poor medical AI system that didn’t collect consent to training data distribution?

They will get proper technical and legal scrutiny by the authorities that will spread meaningful trust in their system (if deserved), without being able to open wash it.

On the other hand, the powerful medical AI that properly collected consent for the distribution of the training data, will be able to both gain trust as an Open Source AI and enrich the commons that other organizations will leverage to further foster innovation in Open Source AI.

“Ain’t gonna happen”

During the Town Hall, @stefano shared that corporations didn’t push back against training data availability, but against sharing the training code.

If this is the case, what’s the problem with mandating the availability of training data that would help thousands builders who don’t “already have all the data they want, and the means to get more”?

Obviously, a OSAID that let them to not share their source data by simply adding personal data or copyrighted data to the training dataset (while escaping scrutiny and accountability) would be detrimental.

I guess that many of them understand well that their training dataset is their most valuable asset and the core of their competitive advantage in the AI market, just like the source code was the core of their competitive advantage in the classical software market.

I mean, “The Unreasonable Effectiveness of Data” is 15 years old, and everybody remembers the famous Peter Norving’s admission “We don’t have better algorithms. We just have more data”.

Historia magistra vitae

Training data are to models what source code is to binary executables.

And I guess that back in early the days of Open Source, if you asked Steve Ballmer if Microsoft was going to distribute their source code, his answer would have been “ain’t gonna happen” too.

Fast forward 25 years, Windows runs Linux binaries, Microsoft .NET is open source just like Visual Studio Code, and Microsoft is one of the major contributors to the Linux kernel.

The Open Source Initiative should leverage the history of open source.

We have a tried and tested method to leverage the playing field in AI.

To require that all of the four freedoms are granted to every user.

And to this aim, we need access to training data.

cora · September 30, 2024, 9:08pm

Well, rock solid arguments.

But you are ruling out federated learning.

Shamar · October 1, 2024, 8:09pm

The problem with federated learning is that a door closed by thousands of different locks is not open.
Not even if you can add a new lock of your own.

cora · October 7, 2024, 7:28am

Good point.

So we can’t have truly open source AI systems that are privacy-friendly?

gvlx · October 7, 2024, 7:52am

If by “privacy-friendly” you mean models trained with privacy-sensitive data which cannot be shared, no we can’t.

The “cannot be shared” part of my sentence already states that this is not “open”.