Draft v.0.0.9 of the Open Source AI Definition is available for comments

Mark · September 25, 2024, 7:54pm

I can’t comment directly on the hackmd but FWIW despite being on-and-off involved in the process over the past year or so (starting with a deep dive last year) and so by some measures a participant in the “co-design” process, I am not able to endorse v. 0.0.9.

The main weakness to me is the same one that’s been noted by @Shamar and others: the weak and underspecified notion of “data information”.

Data information: Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data.

This formulation mirrors the language used in the EU AI Act, which notes that models classified as open source face the requirement to

“draw up and make publicly available a sufficiently detailed summary about the content used for training of the general-purpose AI model”

It would be great if OSI took a firm stand on the issue of what exactly constitutes sufficient detail for a system to even qualify as open source in the first place. For my money, the requirement that a skilled person can recreate a substantially equivalent system doesn’t cut it. It is unspecific and hard to verify (what counts as ‘equivalent’? and how much wiggle room does ‘substantially’ provide?). It is also easy to bypass with synthetic data (cleverly engineered or not) that would allow model providers to obscure true sources and therefore evade scientific, legal and regulatory scrutiny.

The result of the current draft is that we have spiderman-pointing-at-spiderman type situation: the AI Act speaks of sufficiently detailed and the OSI definition does the same. Our paper —identifying the risk of arriving at “a single pressure point that will be targeted by corporate lobbies”— is proving to be possibly more prophetic that I was hoping it would be.

A reasonable question is: what would be better? The issue is far from simple and the devil is in the details (pun not intended but I’ll take it).

For what it’s worth, I think @anon18632855 's contributions elsewhere on these forums offer a constructive way forward.

And also, as some have argued, one could always actually require data to be available (in versioned form) and accept that this means there will be few (but not none) truly open source generative AI models. I know that early in the process a choice was made not to do this; but I’ve also seen a lot of dissent on precisely that point in these forums and elsewhere.