The Missing Third Leg: Training Data Excluded from Open Source AI Definition by [Co-]Design

The case of whisper large-v2 model was exactly what I worried about when proposing the ToxicCandy term in ML-Policy. Speaking from academia side, in order to make progress from the scientific research side, what usually happens is you reproduce a piece of work from the others, check whether the outcome matches the expectation (in the whisper case, it’s at least the performance metrics matching what has been reported in their papers/techreports). Then make whatever modification you want to see your own version of large-v2 is better.

Controlled experiments are important. If you use the same training code plus your own modification, but different data, to obtain a performance improvement over large-v2 – in this case you cannot claim “your own modification” is an improvement. Maybe it’s simply contributed by the data.

Without training data, I see no freedom to reproduce the original piece of work, I see no chance to make improvements in the original work. What I see is a plausible “open-source” with a dataset barrier protecting the monopoly, where the whole community around it can only fork the model while being fully controlled by the capital. Once the company behind it goes down, nobody else in the world would be able to reproduce this work in order to continue maintaining the work or the piece of AI software. This is not what a free software community looks like.

I respect OSI’s hard work on converging to somewhere practical. But I still insist on my personal opinion that the current state is not a good idea for the long run.

2 Likes