Open Source AI needs to require data to be viable

Yes, fair point on that, @moodler and @stefano. I hadn’t seen Downing’s piece, it is very useful indeed. I fear I must leave some of my idealistic notions behind in this context.

I am okay with BloomZ not counting as OSI-open on account of their not using an OSI-open license like RAIL. I think my main concern, in line with the OP of this thread, is that meaningful forms of openness relate more to data availability and disclosure of training and source code than to the licensing of a single derivative component.

E.g. Mistral can merrily release their Mistral 7B or Mixtral model weights under Apache 2.0 (and benefit from open-washing) but if none of the upstream data and code is available, that is not a very meaningful form of openness, as it doesn’t allow users to exercise several of the fundamental freedoms (1 and 3). Model weights are the most inscrutable component of current generative AI, and providers that release only those should not get a free ‘openness’ pass. (In this I am in agreement with @juliaferraioli’s OP.)

In contrast, if a player like BigScience shares source code, data collection, curation, finetuning, and model architecture, releasing most of this under Apache 2.0 — then anyone with the compute (yes I know this will be very few parties) can actually train their own model and exercise all the freedoms. The choice of BigScience to release their own model weights under RAIL primarily affects freedom 0, but not necessarily the others. I recognize that this, too, may not be sufficient to qualify for OSI-openness. (The idealist in me would like to be able to count it as more open though, if only to incentivise model providers to move towards meaningful openness and away from open-washing.)

1 Like