Training data access

lumin · March 6, 2024, 1:48am

The “open access” middle ground is a complicated trade-off. Requiring the data to be fully open-source (e.g. CC-BY-SA such as Wikipedia dump) is not practical to a large portion of models wide spread over the AI user communities – that will make probably 99% existing models uncompliant.

On the contrary, if we remove the “open access” requirement and allow model authors to hide the original training dataset, we lose all spirit and advantages of open source / free software – which would be a historical mistake.

I personally believe the solution lies somewhere near “open access”. But indeed, it is obscure in the case @fontana mentioned.

A direct example is ResNet-50 [1]. It is used by the whole computer vision community, and the pre-trained models are already widely spread over various academic or commercial projects. But the ImageNet is not accessible as anonymous, and is academic-only ImageNet . People can only download it after applying from its website, or download it from kaggle after signing the agreements.

But as a researcher, ResNet-50 on ImageNet is very easy to reproduce, inspect, study, etc. I don’t yet have a good idea on what kind of phrasing can well consider those complications.

[1] A pre-trained resnet can be found here: Models and pre-trained weights — Torchvision 0.17 documentation

Topic		Replies	Views
Data is required - my arguments all in one concise place Open Source AI	2	87	September 25, 2024
On the current definition of Open Source AI and the state of the data commons Open Source AI	16	156	September 15, 2024
The OSAID requires training data to be shared Open Source AI	9	248	October 2, 2024
Proprietary Data Considered Harmful to Open Source AI Open Source AI	6	127	October 10, 2024
Overarching concerns with Draft v.0.0.8 and suggested modifications Open Source AI draft	11	1563	May 19, 2024

Training data access

Related topics