Open Source AI needs to require data to be viable

This is a very important topic, confirmed by its recurring nature, thanks @juliaferraioli and AWS Open Source team for bringing this up.

To limit the risk of rehashing arguments already discussed and to help the discussion flow towards a conclusion, I highlight a few things that I’d like people to keep in mind as they contribute to this thread.

1. Looking for the source of AI leads to confusion

Specifically the ML systems we’re focusing on have a very different structure than the duo source-binary that we’re used to. I highly recommend to abandon the mental map that makes you look for the source of AI (or ML) as that map has been driving us in circles. We have agreed instead that we’re looking for the “preferred form to make modifications to the system”, as in “What components of ML systems are required to change the output given an input.”

2. The law treats data differently than code

It’s pretty clear that the law in most legislations around the world makes it illegal to distribute data, because copyright, privacy and other laws. It’s also not as clear how the law treats datasets, i.e. the difference between distributing Harry Potter’s books in epub format or a tokenized version of the saga (unreadable by humans). The laws around the world are being written and court cases in the US are developing: We won’t know for sure for a long time.

Related question: What should the law say about the tokenized versions of the data (the datasets) is such a crucial topic that OSI is financing a workshop to start framing this issue more clearly. If you’re interested in this topic, a good primer is Open Future’s paper Towards a Books data commons. Eventually, we will have to form an opinion and educate regulators. This will take time and can go on a separate thread.

3. Re-read the definition of Data information in the draft 0.0.8

The text of draft 0.0.8 is drafted to be vague on purpose to resist the test of time and technology changes. But it’s not supposed to be vague on the intention to allow AI/ML practitioners to retrain an Open Source AI model and improve it. Draft 0.0.8 says:

Data information: Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data.

The words come after the workgroups analysis. They’re also based on the fact that the most open ML systems (like Pythia) cannot distribute the dataset they used to train and most likely there will not be legal clarity for many years. Maybe there will never be clarity once global legislation is factored in (for example, distributing The Pile may be legal in Japan but not in the US.)

4. Provide clear examples

When you criticize the draft, provide specific examples in your question, avoid arguing in abstract. For example, say what if the data information discloses that the model has been trained with Common Corpus, StarCoder and the Reddit data? Would that requirement be considered satisfied? Another example: `What if the data information discloses using StarCoder v1 but a court order forces StarCoder to remove v1 from public access and only v2 can be used, will the model that was Open Source AI lose the qualification?

5. Join the townhall meetings

This topic is hot and we need to discuss this issue more. Join the townhalls to ask for clarifications. The next one is on May 31st https://opensource.org/events/open-source-ai-definition-town-hall-may-31