Training data access

The “open access” middle ground is a complicated trade-off. Requiring the data to be fully open-source (e.g. CC-BY-SA such as Wikipedia dump) is not practical to a large portion of models wide spread over the AI user communities – that will make probably 99% existing models uncompliant.

On the contrary, if we remove the “open access” requirement and allow model authors to hide the original training dataset, we lose all spirit and advantages of open source / free software – which would be a historical mistake.

I personally believe the solution lies somewhere near “open access”. But indeed, it is obscure in the case @fontana mentioned.

A direct example is ResNet-50 [1]. It is used by the whole computer vision community, and the pre-trained models are already widely spread over various academic or commercial projects. But the ImageNet is not accessible as anonymous, and is academic-only ImageNet . People can only download it after applying from its website, or download it from kaggle after signing the agreements.

But as a researcher, ResNet-50 on ImageNet is very easy to reproduce, inspect, study, etc. I don’t yet have a good idea on what kind of phrasing can well consider those complications.

[1] A pre-trained resnet can be found here: Models and pre-trained weights — Torchvision 0.17 documentation