Data is required - my arguments all in one concise place

Shamar · September 24, 2024, 8:16am

Thanks for sharing @thesteve0, I carefully read your arguments and I find them very reasonable.

I think the counterproof you are asking are too weak, though.

For example, in Principle of right to inspect, for certain models it’s quite easy to “demonstrate multiple biases in a model without having access to the data” and it has already been done several times in the past.
It’s a weak counterproof because being able to identify some bias is not equivalent to being able to identify all bias in the model. To be able to identify all bias in the model you need to have access to the whole dataset.
A better counter-proof would be to demonstrate that it’s always possible to identify all the bias in the model without access to the data used to create it.

In Principle of right to modify, you say that “exact match may not be possible due to the randomness in the fitting process”, but any source of randomness can be recorded and reproduced

As for you argument about “field of use restriction” I think it’s very interesting and I’d like to reason with you about the definition I proposed before, that I think would address the issue.

In particular, for the source data it stated

Source Data: All the data used to train the system must be available for use by the public either under licenses that comply with the Open Source Definition or under the same legal terms that allowed their use to train the system in the first place, so that a skilled person can recreate an exact copy of the system using the same data.

Requiring data to “be available” instead of being “made available” removes legal uncertainly around distribution of data, and “by the public” imply “free and public access to the data”.

I don’t care much about the wording (and I’m neither a native English-speaker nor a lawyer), so I think we could and should improve that definition (“must”? “shall”? ) and change whatever suit.

But do you think that such definition would address your concerns? How would you improve it? Can you propose a different and better formulation?