Ensuring Open Source AI thrives under the EU's new AI rules

Originally published at: Ensuring Open Source AI thrives under the EU’s new AI rules – Open Source Initiative

In 2024, the European Union approved the Artificial Intelligence Act, the world’s first legal framework for AI. Part of the law mandated the creation of a Code of Practice for General Purpose AI for AI developers. The OSI applied to take part in the drafting of these rules, but when the first draft arrived, we discovered issues that would make it impossible for Open Source AI projects to comply. Here’s how we fixed them.

1 Like

Very useful to get insight into the process, thank you for sharing.

On the 11th of March the third draft of the Code of Practice was released, which made acceptable use policies optional, and exempted Open Source AI from prohibiting certain downstream uses.

Optional for all general purpose AI, or only for open source AI systems?

(If the former, big tech including the large closed AI model producers will be extremely happy i’m sure.)

If the latter, this will have greatly increased the attraction of open-washing, and with the OSAID watered down to require merely data information it will have made compliance (including malicious compliance) a bit easier again. Correct me if I am wrong.

Hey Mark,

Really glad that the insight is useful! I’m going to try to write a bit more about the policy-related work we’re doing here!

The third draft made AUPs optional for all General Purpose Ai systems. We would have been happy with an exemption simply for Open Source Ai systems, but in my view (at the end of the day), it doesn’t make sense trying to put restrictions that exist in law into everyone’s licences. AUPs make sense in the context of AI as a Service, but not when you’re running the model locally (they are pretty much unenforceable in that context anyway).

I’d also like to take a moment to address your point about the OSAID. Just as background, I was responsible for creating an exemption for Open Source AI in the EU’s AI act (Previously, I worked for the EU). At the time, and until very recently, nobody knew what Open Source AI meant. In my first month working for OSI in Brussels, I repeatedly heard this question from lawmakers.

When talking about Open Source AI, the same set of issues arose over and over again, when discussing training data: What about Data Protection and Privacy? What do we do about different artistic works having different licences depending on the country? What happens if an Open Source AI model discovers one of the images in its training data is copyrighted and cannot be redistributed? How can we ensure legal certainty when all it would take was for a single copyrighted image to be accidentally included in a dataset, to render all AI models using that dataset proprietary (with massive consequences for what they then have to do to comply with the EU’s AI act)?

These are real and genuine concerns, which the FSF has also identified in its work on Open Source AI. SFC has also identified these issues, saying that it may take a decade to come up with a definition.

That’s fine and valid, but the problem is that a certain big tech company had already given its answer: essentially “Open Source AI = Open Weight AI”. While we both know that is absurd, the problem is that for lawmakers, it’s quite compelling: suddenly all the problems lawmakers had noticed around training data disappear.

Of course, we both know that an Open Weight model alone doesn’t grant you the four freedoms, and we both know that Llama is even more restrictive. We could have waited a decade before releasing a definition, but the truth is, if we had, Open Source AI would have been defined by one or two large companies instead of the Community.

I’ve been presenting our work on the Open Source AI Definition to lawmakers now for five months, and it is the only other compelling and legally sound definition they have heard. I can confidently say it is the best way to fight back against Open Washing, at least here in Brussels.

I also think it’s important to note that the OSAID doesn’t just let devs get away with not publishing the training data: if they don’t, they have to publish detailed info on the data they used (including linking to that data or where it can be obtained if possible). Frankly, the work required to do this is extremely taxing, meaning releasing the training data is almost always preferable. Essentially, the OSAID’s approach is: release the data, if not tell us where it can be obtained, and give us the code to sort, organise and prepare it, if you can’t do that, give us an adequately detailed description so that we can reproduce the dataset ourselves.

So far, (as far as I am aware, correct me if i’m wrong!) there is not a single OSAID compliant system that does not release its dataset.

I fully understand and respect people who have some reservations about the OSAID, and if a case arises where there is an OSAID compliant system that does not release all its training data, then by all means, let’s discuss it, but right now, we are debating the colour of wallpaper while a certain big tech company is bulldozing the house.

The OSAID is currently the best way to fight back.

7 Likes

Appreciate your response Jordan! Very interesting to get this behind the scenes look and helpful to get your take on the legal context of the OSAID’s functioning.

Regarding this:

Essentially, the OSAID’s approach is: release the data, if not tell us where it can be obtained, and give us the code to sort, organise and prepare it, if you can’t do that, give us an adequately detailed description so that we can reproduce the dataset ourselves.

I note that your paraphrase here is already a bit more detailed than what the OSAID and the faq actually specify. I think the categories of “obtainable (including for a fee)” and “private” provide a lot of wiggle room for bad-faith compliance.

As a community member pointed out here, water seeks its own level. Model providers will generally seek ways of complying that minimise hassle and legal exposure. I am not sure the “adequately detailed description” will sufficient or well-specified enough for meaningful levels of openness, especially not since it seems to mirror the underspecified notion of “sufficiently detailed” in the AI Act.

As we wrote last year in a paper devoted to what I now know is your contribution to the EU AI Act (respect!): “Figuring out what constitutes a “sufficiently detailed summary” will literally become a million dollar question.”

Anyway, the proof of the pudding is in the eating. We will have to see whether and how model providers will seek compliance with OSAID and we will be watching this space with a lot of curiosity.

Thanks again for the insights into the process, much appreciated.