I have commented in numerous locations about the requirement for open data to be labeled Open Source AI.
I have coalesced them into one complete document with clear reasoning and the potential for people who disagree to prove me wrong.
This current draft definition is not valid and it’s approval, while good for the corporations, will bring a large drop in respect for OSI.
At a time when we are trying defend Open Source, OSI seems intent on weakening it with this Draft.
Comments to the blog post in a thread here are greatly appreciated.
I would like to note that, I have yet to see an OSI board members respond to any of the comments calling for open data.
So is this discussion board /dev/null or is the OSI really interested in discussing the topic?
Perhaps I have missed them. If so, please do me the kindness of pointing them out to me.
Thanks for sharing @thesteve0, I carefully read your arguments and I find them very reasonable.
I think the counterproof you are asking are too weak, though.
For example, in Principle of right to inspect, for certain models it’s quite easy to “demonstrate multiple biases in a model without having access to the data” and it has already been done several times in the past.
It’s a weak counterproof because being able to identify some bias is not equivalent to being able to identify all bias in the model. To be able to identify all bias in the model you need to have access to the whole dataset.
A better counter-proof would be to demonstrate that it’s always possible to identify all the bias in the model without access to the data used to create it.
Source Data: All the data used to train the system must be available for use by the public either under licenses that comply with the Open Source Definition or under the same legal terms that allowed their use to train the system in the first place, so that a skilled person can recreate an exact copy of the system using the same data.
Requiring data to “be available” instead of being “made available” removes legal uncertainly around distribution of data, and “by the public” imply “free and public access to the data”.
I don’t care much about the wording (and I’m neither a native English-speaker nor a lawyer), so I think we could and should improve that definition (“must”? “shall”? ) and change whatever suit.
But do you think that such definition would address your concerns? How would you improve it? Can you propose a different and better formulation?