Open Source AI needs to require data to be viable

juliaferraioli · June 3, 2024, 5:48pm

I wanted to let this discussion evolve before jumping back in and I’m really happy to see such nuanced perspectives.

Blockquote
The law treats data differently than code

Yes, it does. And that’s okay. We aren’t constructing a legal definition here. We aren’t creating new licenses. We’re creating a definition of what it means, in spirit, to be open source AI. Looking at it from the legal perspective and working backwards from that is not productive. It is artificially limiting, and creates a brittle definition as the law is variable country by country and evolves over time.

He goes on to say that

Blockquote
The text of draft 0.0.8 is drafted to be vague on purpose to resist the test of time and technology changes.

Resisting time and technology changes is good, but being as vague as the current draft is is nearly meaningless. It relies too much on interpretation and judgement calls. My definition of a skilled person is different than another’s. My understanding of a substantially equivalent system may be wildly different than someone else’s. The same goes for same or similar data.

If same or similar data is not freely available and accessible to an arbitrary party, does it fail the test? Or is this an intended loophole?

Examples

@stefano, you requested that I create this thread with this text. I am providing clear feedback on the definition as drafted. Clear examples are difficult, given the intentionally vague language in the draft.

However, I think that Spot’s example of the Reddit data set is a good one. You can give all the details of the Reddit data set that your AI system was trained upon, but if everyone has to fork over $$$ to license the data set in order to modify it, then it’s not really open source, is it?

@Danish_Contractor posed two scenarios

Blockquote
AI system number 1 shares that it was:
“Trained on data English language data from Wikipedia” and it releases weights under OSI compliant license.

Blockquote
AI system number 2 says:
"Here are URLs of articles from Wikipedia we trained on" and releases weights under OSI compliant license.

These are missing extensive information, like the timestamp at which the data were fetched. Those timestamps and therefore versioned data are critical to being able to train the system. Without the versioned data, I wouldn’t consider it open. There’s a reason that when citing a webpage in a paper, you include the “retrieved at” date.

@moodler referenced the Whisper model

Blockquote
https://github.com/openai/whisper/tree/main/data

Blockquote
It might be a nice example to discuss … is it sufficient or not to qualify this model as Open Source AI?

I think this is a good step. I haven’t gone through the work of reproducing it, but assuming it is complete then yes, I believe it would (almost) qualify. Again, I would want retrieved at dates. It isn’t as good as including the raw data directly, but that is a high bar.

Levels of openness

From an ideological standpoint, I like the levels of openness concept. From a practical standpoint, I think it would be too complex to administer, understand, and consume. In this, I agree with @stefano.

At times like these, I do wonder if we are over indexing on a definition. If we abstract away the ins and outs, we can consider this like complex software with multiple dependencies. If one dependency/aspect of that software is not open source, then the resulting piece of complex software is not open source.

Like-to-like is always easier to evaluate, but at the end of the day we’re dealing with the same issues. Do all (present and necessary) components need to be open to be open (source)? Yes.

If they aren’t, then it’s not open source.

Field of use restrictions / restrictions

It is my opinion that the definition (if there is one definition) needs to include prohibitions against field of use restrictions, like the OSD has. It currently does not.