Data is required - my arguments all in one concise place

thesteve0 · September 23, 2024, 7:16pm

I have commented in numerous locations about the requirement for open data to be labeled Open Source AI.

I have coalesced them into one complete document with clear reasoning and the potential for people who disagree to prove me wrong.

This current draft definition is not valid and it’s approval, while good for the corporations, will bring a large drop in respect for OSI.

At a time when we are trying defend Open Source, OSI seems intent on weakening it with this Draft.

Comments to the blog post in a thread here are greatly appreciated.

I would like to note that, I have yet to see an OSI board members respond to any of the comments calling for open data.
So is this discussion board /dev/null or is the OSI really interested in discussing the topic?
Perhaps I have missed them. If so, please do me the kindness of pointing them out to me.

https://blog.techravenconsulting.com/model-weights-is-not-enough-for-open-source-ai/

Shamar · September 24, 2024, 8:16am

Thanks for sharing @thesteve0, I carefully read your arguments and I find them very reasonable.

I think the counterproof you are asking are too weak, though.

For example, in Principle of right to inspect, for certain models it’s quite easy to “demonstrate multiple biases in a model without having access to the data” and it has already been done several times in the past.
It’s a weak counterproof because being able to identify some bias is not equivalent to being able to identify all bias in the model. To be able to identify all bias in the model you need to have access to the whole dataset.
A better counter-proof would be to demonstrate that it’s always possible to identify all the bias in the model without access to the data used to create it.

In Principle of right to modify, you say that “exact match may not be possible due to the randomness in the fitting process”, but any source of randomness can be recorded and reproduced

As for you argument about “field of use restriction” I think it’s very interesting and I’d like to reason with you about the definition I proposed before, that I think would address the issue.

In particular, for the source data it stated

Source Data: All the data used to train the system must be available for use by the public either under licenses that comply with the Open Source Definition or under the same legal terms that allowed their use to train the system in the first place, so that a skilled person can recreate an exact copy of the system using the same data.

Requiring data to “be available” instead of being “made available” removes legal uncertainly around distribution of data, and “by the public” imply “free and public access to the data”.

I don’t care much about the wording (and I’m neither a native English-speaker nor a lawyer), so I think we could and should improve that definition (“must”? “shall”? ) and change whatever suit.

But do you think that such definition would address your concerns? How would you improve it? Can you propose a different and better formulation?

gvlx · September 25, 2024, 1:22am

I would need to add another requirement: that the data must be versioned.

Just like a verifiable build, the training data used to recreate a model must be exactly the same as the one used in the original training.

Just like code, dataset are stored in repositories, where they can be versioned.