On the current definition of Open Source AI and the state of the data commons

I would like to share an article by Nathan Lambert (ML Scientist at AI2), reviewed by Percy Liang (Co-founder at Together.ai, Director CRFM/HAI at Stanford), Stella Biderman (AI Researcher at EleutherAI), Aviya Skowron (Head of Policy and Ethics at EleutherAI), and Yacine Jernite (ML and Society Lead at Hugging Face).

I believe this is a very balanced view from the top leaders in the “Open Source AI” space and the struggles they face with regards to the training data.

1 Like

Percy Liang’s comments on X is also worth mentioning:

1 Like

This is a very good discussion that touches on issues related to personal data. The analysis from a copyright perspective by Senficon should also be read by many people.

In Japan, where I live, it is already a legal practice, but in the future, it will likely become explicitly legal in many jurisdictions to use copyrighted works for AI training without permission. However, even if the data used for AI training is used legally, it should not be permitted to distribute it for purposes other than AI training. This is the nature of the copyright system, and obviously, such data is not open data. However, if we can provide other developers with a way to access the data used for training, those developers should be able to create equivalent AI models using the exact same methods. We must avoid a situation where we cannot call something Open Source simply because we used data that could be legally utilized without anyone’s permission.

Since I don’t trust X/Twitter, it’s worth copying the content of that thread here for archival/search purpose:

Taken from xcancel, by Percy Liang:

Thoughts on OSI’s draft open-source AI definition v0.0.9: Key question: how much information about the data needs to be disclosed to get the open-source stamp? My initial reaction was that the entire dataset must be released, but now I think this is neither sufficient nor necessary.

LM pretraining data has copyright, privacy, and consent issues. Even beloved datasets such as CommonCrawl that researchers use all the time are not in the clear, so forcing model developers to release questionable data doesn’t seem right.

(And if you don’t include questionable data, you don’t get a good model.)

On the flip side, even a full data release is insufficient for the goals of open-source (to study and modify). From just the data, it is hard to understand why certain tokens are included or excluded. For this, you really need the code for the full data processing pipeline.

The current OSI definition says “Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data”.

I think this is too vague and should be sharpened to exclude high-level descriptions (e.g., the LLaMA paper) and include code (e.g., RedPajama, Dolma, FineWeb).

To analogize models with software: Think of the entire procedure of creating a model (data processing, filtering, tokenization, training, etc.) as the code, and the generated assets (training data, model weights) as the binaries created through execution/compilation.

Having the entire procedure allows one to understand the model holistically and to modify the system in a much deeper way (given compute).

The nice thing about this view is that everything is just code at the end of the day, and we have decades of experience with open-source software to lean on.

A open-source definition proposal: the following should be released:

  • Code for entire procedure
  • Any part of the executed versions (data, weights) without copyright/privacy concerns
  • Pointers to all the raw data (CommonCrawl, torrent link, etc.)

See @natolambert’s recent blog post (interconnects.ai/p/defining-…) for some related thoughts, and thanks to @BlancheMinerva @aviskowron @YJernite @natolambert @smaffulli for helpful conversations.

For context: Eleuther is currently direct co-defendant or next-in-line in a number of ongoing matters. It’s hard to fully separate some of these statements from their ongoing legal issues…

If we properly equate the source dataset and the entire procedure of creating a model (data processing, filtering, tokenization, training, etc.) as the code, including the values provided by random sources during the whole process, so that the exact executables (the weights) can be compiled again from the sources, the definition would become crystal clear:

Data information: Sufficiently detailed information and all the data used to train the system (including any random value used during the process), so that a skilled person can recreate an exact copy of the system using the same data. Data information shall be made available with licenses that comply with the Open Source Definition.

And if you don’t include questionable data, you don’t get a good model.

First this is not true in general: it’s true that most hyped models are built on top of questionable data, but several AI systems are built from data that are not human related (such as weather or pollution data or those related to engineering fields) or questionable at all (such as aggregated traffic data over streets).

Morover that’s completely orthogonal to the OpenSource-ness of an artifact: for example, that are plenty of low-quality or bug-ridden (and even backdoored) JS packages on NPM (do you remember color.js?) that were open source anyway.

On the other hand, a proper OpenSource AI definition that really grant the right to study and modify the system, might lead to better open and unquestionable dataset in those fields where they lack.

To be honest, that article includes several passages that looks like logical non-sequitur.

A functional definition of open source AI cannot require parties to commit potentially illegal acts with data, but the system still needs to be easy to build upon.

This is the deepest and biggest non-sequitur: nobody would force anyone to commit potentially illegal acts with data by providing a definition of OpenSource AI, at worst the developers can choose to not define their system as “OpenSource AI” but as Freeware or whatever.

In fact no definition can force anyone to use

  1. Content presumed under standard copyright is not technically open data (according to the European Union and other institutions).
  2. Personal data should be handled differently than source code (and not readily redistributed).

There are plenty of huge datasets that can be used to train useful AI systems without pose any of such legal issues (weather dataset, all sort of industrial machines-related and software-related logs and so on…)

As an example, only some of Ai2’s recent open dataset Dolma is totally clear of these questions. Additionally, if there is personal data in a dataset that is publicly uploaded, what happens to the nature of the “open source” dataset if personal data is later removed to comply with GDPR?

Simple: you update the model and release a new major version. :wink:

The old versions would not qualify anymore as OpenSource AI, as the whole dataset used to train the system are not available anymore, but the new version would.
And the removal of the personal data row would not impact any legally binding license on the old version.

And if you don’t build a new version out of the patched data (for any reason) other will, exactly as it routinely happens whenever a permissively licensed open source software got released with a proprietary or source-available license out of corporate greed.

We didn’t change the Open Source definition to accommodate for Redis license change. In the same way, we shouldn’t set a OpenSource AI definition that would be used, in fact, to open-wash anything.

Works created by U.S. federal government employees are generally in the public domain. There is no problem using these works for AI training within the United States. However, these works are not in the public domain in other countries. This means that while it is legal to use documents created by the U.S. government for AI training in the U.S., it may not be legal in other countries.

In my country, Japan, it would probably be permissible to use U.S. government works for AI training under Article 30-4 of the Copyright Act. However, making a dataset composed of U.S. government documents publicly available in Japan could lead to complex issues involving privacy, moral rights, and other considerations.

This may be a very narrow issue, and in practice, it might not pose any real obstacles, but would it still be considered “open-washing” if such problems led to keeping the dataset non-public?

It’s open washing if you call a system “open source” without providing all that is is needed to study and modify it, whatever the reason.

In your hypothetical case, if the developers provide enough information about how to retrieve and process the data “so that a skilled person can recreate an exact copy of the system using the same data”, they can legitimately call their system “Open Source AI”.

If, instead, they decide to hide a portion of the data or any information needed “so that a skilled person can recreate an exact copy of the system using the same data”, any attempt to pass the system as “Open Source AI” would be open washing.

Then, what if the dataset of U.S. government documents, which could cause legal issues if it contained actual data, is converted to a dataset that only includes the URLs and is then made publicly available? This would essentially have the same content at that point in time, but would it still be considered open-washing? Please note that the documents at those URLs may change or be deleted over time.

Well, that system would qualify as OpenSource AI as long as all the documents pointed by the URL stay available and unmodified.

However, to comply to the definition, such dataset should also include a UTC timestamp of the document retrival operated to get the actual data to use in training and a cryptographic hash of the content of each document downloaded, as they are needed to check the dataset identity and integrity.

This way, users of the system could actually identify the inapplicability of the OSAI definition without wasting GPU cycle, cooling water and energy.

In any case, in the moment that the original training dataset becomes unavailable (in whole or in part), any system built from it would not match the Open Source AI definition anymore.

Thus I’d argue that it would be a shady approach that leads towards open washing. Yet, in some rare situations it might be an ethically reasonable approach.

Indeed, as long as the data stay available online and anybody can detect any corruption / removal / modification with a simple curl | sha512sum, the system could qualify as Open Source AI.

I roughly understand your point of view. Thank you for the explanation.
Applying your perspective to OSAID, it seems very difficult for large language models to be recognized as Open Source AI.

Why?

It’s perfectly possible to build LLM that match such Open Source AI definition.

So much that it has already been done in Italy by the University La Sapienza were a group of researcher trained a fundational model called Minerva with 5 billion tokens from open access texts only.

If an underfunded University did it already, we know it’s not difficult at all.

And in fact I guess there are several other LLM out there that would match a proper OSAI definition, but they are simply obscured by the hype that surround opaque and uninspectable commercial LLMs.
We shouldn’t strive to adopt a OSAI definition that let such opaque and closed systems to pass as “open source”, but we should strive to let existing and novel systems that really follows the values and provide the freedoms of open source to shine!

There is really no reason to adopt a misguided Open Washing AI definition instead of a coherent Open Source AI one.

1 Like

Looking at the site for the Minerva series LLMs, it states that they use datasets such as CulturaX, OSCAR-2201, OSCAR-2301, and mC4. These datasets are extracted from Common Crawl, and the OSCAR page specifically includes a disclaimer: “Being constructed from Common Crawl, personal and sensitive information might be present.” This means that if any problems arise due to sensitive information, the data may not be publicly released.

Additionally, while it mentions that the training data for Minerva LLMs is sampled from datasets like OSCAR, I could not find where the actual data is located.

This is the last reply to this thread. The thread is a bit long.

Sorry, but I can’t follow your argument.

If the data sources are available and legally usable to train the LLM, a simple link with versioning and a sha512sum would be enough.

We are exactly in the same scenario you proposed above and the same solution apply.

The point remains: it’s perfectly possible to create a LLM that comply to a Open Source AI definition that requires the availability of the data used to train the model’s weights, "so that a skilled person can recreate an exact copy of the system using the same data”.

Well, it’s not the longest thread we have joined so far, but it has been a deep exchange on the supposed limits of a proper Open Source AI definition, that proved its applicability to interesting corner cases.

Anyway, thanks for the conversation and if you have further doubts or suggestions, I’d be happy to discusss them!

I have to agree with @shujisado that this thread is getting quite long and repetitive with no new arguments, so I’m closing the topic.

1 Like