Open Source AI needs to require data to be viable

juliaferraioli · June 6, 2024, 5:56pm

@zack and @shujisado on field of use restrictions – individual components may be covered by their individual licenses (depending on what conformant/compliant wind up meaning), but the overall system may be subject to additional terms, which is why we need this to be explicit. See explanation in the parent post.

Nobody is rebuilding from scratch OLMo or Pythia just because they want to replicate the build before shipping it to their users (like Debian does for its software packages.) It makes no sense to do so: retraining a system is not going to generate an identical system anyway, and it’s guaranteed to cost money and time without even generating academic points.

I vehemently disagree with your evaluation here, @stefano. Nobody may be doing it right now, but for broader adoption it is critical to have the capability to do so. Just because you don’t see a use here, doesn’t mean there isn’t one. We’re talking about statistical models. Even if the resulting system is not 100% line for line identical, the resulting system should produce the same results from a statistical standpoint given the same parameters.

Is the draft Open Source AI Definition working as intended?

It still is not clear how it is intended to work. I hear conflicting information all over this thread. Until the criteria are concrete, then exercises to evaluate systems are subject to extensive subjective interpretation, meaning you cannot normalize their results against each other.

Please correct me if I’m missing some overarching point here, but the counterargument against requiring data seems to be “it’s too hard”?

I echo @spotaws’s comment:

If the subtext here is that “there are not enough good open datasets for training AI systems”, then let’s invest time and money in that instead of lowering the bar, which will have a chilling effect broader than AI.

Great call out @nick on the license change for dolma. Given that, I do think that OLMo would qualify as open source based on the criteria that I posit is necessary. Pythia still would not as the data are not licensed.