Training data access

I think @Aspie96 hits the issue very well

also great:

But, I had more conversations on the data issue around the FOSDEM weekend and I had a sudden revelation after one conversation with a ML creator. As we chatted, he clarified what he meant by saying “we need the data”: For his projects, he simply doesn’t have good data. Assembling a good dataset is tedious, time consuming, expensive and legally uncertain. Distributing a good dataset is even more complicated.

As I asked more questions, he clarified that he doesn’t need the original dataset to modify Llama2 or Whisper. He just wish he had more large, high-quality datasets to train and finetune his own systems, built on top of other foundation models.

This specific conversation left me wondering if we’re asking the wrong question. Given the legal status of data, I can’t see how we’re going to make the original dataset a dependency of a trained model and also have any meaningful definition of Open Source AI.

Maybe what AI creators and researchers are telling us when they ask for the data is that we should really have a way to force or incentivize companies to release their datasets, independently from the models?