Open Source AI needs to require data to be viable

Stefano, I think the goal of OSI as sketched here by you is eminently reasonable and laudable. For legislatory and regulatory purposes, we probably need that kind of clarity, and I will again note that I’m in agreement with the OP that the most viable version would be one that requires data.

One aim of empirical efforts like ours is to help make visible the amount of work that still needs to be done in order for model providers to clear that kind of bar. I don’t see how calling out open-washing and making visible the precise openness dimensions open-washers are trying to obfuscate is “playing into their hands”. I’m hoping we agree Llama3, Mistral and the like, with their strategic lack of clarity about training and fine-tuning data, definitely don’t qualify as “open source” and are “at best open weights”, as we write in our paper. Sunlight is the best desinfectant. But then again I haven’t been at this for 26 years yet, and may be too optimistic. :smiley:

It’d be great to have one more column to your paper: after showing the availability of components, finally puts a checkmark of whether that system passes or fails the “Open Source AI Definition.”

All our data is available and the kind of additional column you mention would be fairly easy to realize. It would be like method 4 in our Figure (where a small slice of models survives a dichotomous open/not open decision).

there is no “oh but this is slightly better, more open because …” Nope, it goes in the same pile over there: NOT Open Source AI

I can definitely see why this makes sense from the OSI point of view. Our interests are broadly aligned but don’t fully overlap. For instance, for academic research and educational purposes, very open models are useful and important to keep track of, even if they have restrictive licensing; which is another reason our index exists.

1 Like