How we passed the AI conundrums

system · October 9, 2024, 5:01pm

Originally published at: How we passed the AI conundrums – Open Source Initiative

Some people believe that full unfettered access to all training data is paramount. This group argues that anything less than all the data would compromise the Open Source principles, forever removing full reproducibility of AI systems, transparency, security and other outcomes. We’ve heard them and we’ve provided a solution rooted in decades of Open Source…

Shamar · October 9, 2024, 10:25pm

While some might argue that this article sounds like a straw man defense of a political decision that OSI is trying to impose to the “co-design process” from the very beginning, I think it can be a step in the right direction as it shows clearly a deep misunderstanding of the several issues that still plague the latest OSAID.

This group argues that anything less than all the data would compromise the Open Source principles

Reading these words, I realized that OSI directors consider the people reporting such issues as idealists, while in fact all of the issues reported are very pragmatic and will have serious social consequences, in particular for marginalized groups.

No accountability… and no escape

First and foremost, Mr. Choudhury argues that, with the current OSAID

You can say “I will only use Open Source AI built with open data, because I don’t want to trust anything less than that.” A large organization could say “I will buy only Open Source AI that allows me to audit their full dataset, including unshareable data.”
You can do all that.

Except that, with the current OSAID, in Europe you cannot.

If a public administration legally bound to prefer open source systems (such as Italian PA) will buy an open washed toxic candy from Meta, Google, Microsoft or anybody else, all citizen will be forced to use such system.

If any black box can pass as an “Open Source AI”, it will not be subject to the legal and technical scrutiny that the AI Act impose to proprietary systems to mitigate their inherent risks:

Researcher will be paid to study the dataset of proprietary systems to identify bias and backdoors
Researcher will be forbidden to access the “unshareable” datasets of “Open Source AI”

So thanks to OSI we face the paradox that proprietary AI would be safer to use and more accountable than Open Source AI!

I’ve already mentioned the Boeing 747 Max deaths due to the improper adoption of an AI system.

But what about the 2015 Dieselgate, with Volkswagen and several other companies cheating emission test through proprietary software?
You’ll all remember how Schneier and several others argued that a way to actually force such companies to reduce emissions is impose them to open source the software for public scrutiny.

Guess what?

With the “Open Source AI” definition that OSI is proposing, they could keep cheating and at the same time completely avoid any accountability, without any technical scrutiny neither from the public, nor from researchers.

Just release the code, plant the backdoor into the unshareable part of the training data and pretend your black box is Open Source AI.

Source Data are not a runtime.

To justify the OSI decision to not require training datasets, Mr. Choudhury compares such dataset to compilers and runtime environments: just like Open Source Definition does not mandate the use of a os compiler or an os operating system, the Open Source AI should not mandate source data availability.

But can we really compare training data to such components?

When we are forced to compile Emacs with a proprietary compiler, or to link it against a proprietary system library, we cannot blindly trust that they do what the documentation states, but we can verify it.
So much that proprietary compilers and system libraries still receive bug and security reports from developers.

Why we can verify that proprietary components we are forced to use respect the documentation while compiling / running GNU Emacs?

Because we have full access to the source code of GNU Emacs!
But how can we verify that, say, proprietary GPU drivers that we are forced to use work like expected if we don’t have the full training dataset?

And how can we even know how the system was really intended to work (whatever the environment) if we cannot inspect the full training dataset?

And no, “blind trust” (in the paper compliance certified by OSI) is not a solution.

Weights are tied to Source Data

Can you fully study Emacs on Mac OS?
For Emacs, yes.
For the MacOS components, no.

This is an interesting argument, because Mr. Choudhury recognizes that Open Source used to grant the full freedom to study the open source artifact itself.
Yet the OSI proposal for a “Open Source AI” definition prevents such freedom whenever “unshareable” data are used in training.
(and turn it into a privilege when such data are obtainable for a fee)

Mr. Choudhury argues:

You developed a text editor for Mac OS but you can’t share the system libraries? Fine, we’ll fork it: give us all the code you can legally share with an OSI-Approved License and we’ll rip the dependencies and “liberate” it to run on GNU.

But wait: no part of your editor will have to be discaded
Thanks to the Google vs Oracle precedent, we can just add a compatibility layer that reimplement the API used by your editor.

Instead, with the Open Source AI definition, to make open source for real an open washed black box that included unshareable data, we have to completely discard the weights!

Why, if weights are required under “OSI approved terms”?

Treated the same, but different

Mr. Choudhury also writes:

You can’t legally give us all the data? Fine, we’ll fork it. […]
The system will be slightly different but it’s still Open Source AI.

But why must we fork it, if it’s already Open Source AI?

Because without the training data, it’s a black box!

So people refusing to use an open washed black box will be forced to discard the weights and repeat the training and all.

But let suppose that a startup accepts the challenge and try to sell a truly open source AI system. The black box is still Open Source AI.
Just like the startup product.

Will a customer pick the “Open Source AI” (black box) from a Big Tech or the Open Source AI (transparent) from the brave startup?

Why the transparent AI system that really grants the users all the four freedoms should compete with the black box that does not?
Why should them be treated equally if they are not?

Wasn’t the OSAID meant to “level the playing field”?