Explaining the concept of Data information

Originally published at: https://opensource.org/blog/explaining-the-concept-of-data-information

This post clarifies how the draft Open Source AI Definition arrived at its current state, the design principles behind the Data information concept and the constraints (legal and technical) it operates under.

1 Like

I have really enjoyed following the discussion so far, thank you @stefano for summarizing the case for data information.

I would like to make a few points about the implications of copyright law for the application of open source principles to the subject of AI, especially for the question of training data access. They largely lead me to the conclusion that data information is a viable concept for the purposes of the OSAID.

The definition of open source software has a legal element and an access element – the access element being the availability of the source code and the legal element being a license rooted in the copyright-protection given to software. The underlying assumption is that the entity making software available as open source is the rights holder in the software and is therefore entitled to make the source code available without infringing the copyright of a third party, and to license it for re-use. To the extent that third-party copyright-protected material is incorporated into the open source software, it must itself be released under a compatible open-source license that also allows the redistribution.

When it comes to AI, the situation is fundamentally different: The assumption that an open source AI model will only be trained on copyright-protected material that the developer is entitled to redistribute does not hold. Different copyright regimes around the world, including the EU, Japan, and Singapore, have statutory exceptions that explicitly allow text & data mining for the purposes of AI training (I will leave aside the discussion of fair use in the US and several other jurisdictions here, as it is a controversial subject in many online discussions and it isn’t central to my argument). The EU text & data mining exceptions, which I know best, were introduced with the objective of facilitating the development of AI and other automated analytical techniques. However, they only allow the reproduction of copyright-protected works (aka copying), but not the making available of those works (aka posting them on the Internet).

That means that an open source AI definition that would require the republication of the complete dataset in order for an AI model to qualify as open source would categorically exclude open source AI models from the ability to rely on the text & data mining exceptions in copyright – that is despite the fact that the legislator explicitly decided that under certain circumstances (for example allowing rights holders to declare a machine-readable opt-out from training outside of the context of scientific research) the use of copyright-protected material for the purposes of training AI models should be legal. This result would be particularly counterproductive because it would even render open source AI models illegal in situations where the reproducibility of the dataset would be complete by the standards discussed here on the forum.

To illustrate: Imagine an AI model that was trained on publicly accessible text on the Internet that was version-controlled, for which the right holder had not declared an opt-out, but which the right holder had also not put under a permissive license (all rights reserved). Using this text as training data for an AI model would be legal under copyright law, but re-publishing the training dataset would be illegal. Publishing information about the training dataset that included the version of the data that was used, when and how it was retrieved from which website and how it was tokenized would meet the requirements of the OSAID v 0.0.8 if (and only if) it put a skilled person in the position to build their own dataset to recreate an equivalent system. Neither the developer of the original open source AI model nor the skilled person recreating it would violate copyright law in the process, unlike the scenario that required publication of the dataset. Including a requirement in the OSAID to publish the data, in which the AI developer typically does not hold the copyright, would have little added benefit but would drastically reduce the material that could be used for training, despite the existence of explicit legal permissions to use that content for AI training. I don’t think that would be wise.

While I support the creation of public domain datasets that can be republished without restrictions, I would like to caution against pointing to these efforts as a solution to the problem of copyright in training datasets. Public domain status is not harmonized internationally – what is in the public domain in one jurisdiction is routinely protected by copyright in other parts of the world. For example, in US discourse it is often assumed that works generated by US government employees are in the public domain. They are not, they are only in the public domain in the US, while they are copyright-protected in other jurisdictions. The same goes for works in which copyright has expired: Although the Berne Convention allows signatory countries to limit the copyright term on works until protection in the work’s country of origin has expired, exceptions to this rule are permitted. For example, although the first incarnation of Mickey Mouse has recently entered the public domain in the US, it is still protected by copyright in Germany due to an obscure bilateral copyright treaty between the US and Germany from 1892. Copyright protection is not conditional on registration of a work and no even remotely comprehensive, reliable rights information on the copyright status of works exists. Good luck to an open source AI developer who tried to stay on top of all of these legal pitfalls.

Bottom line: There are solid legal permissions for using copyright-protected works for AI training (reproductions). There are no equivalent legal permissions for incorporating copyright-protected works into publishable datasets (making available). What an open source AI developer thinks is in the public domain and therefore publishable in an open dataset regularly turns out to be copyright-protected after all, at least in some jurisdictions. Unlike reproductions, which only need to follow the copyright law of the country in which the reproduction takes place, making content available online needs to be legal in all jurisdictions from which the content can be accessed. If the OSAID required the publication of the dataset, this would routinely lead to situations where open source AI models could not be made accessible across national borders, thus impeding their collaborative improvement, one of the great strengths of open source. I doubt that with such a restrictive definition, open source AI would gain any practical significance. Tragically, the text & data mining exceptions that were designed to facilitate research collaboration and innovation across borders, would only support proprietary AI models, while excluding open source AI. The concept of data innovation will help us avoid that pitfall while staying true to open source principles.

9 Likes

As a developer and the founder of a social venture, I am less interested in restrictive definitions than in creating practical and usable tools that can foster and popularize the propagation of open data. Creative Commons is a great example of how to go about this.

I suspect we are still operating under the assumption that users have no interest in sharing their data or opinions. The success of social media proves otherwise!

Even in sensitive areas such as healthcare, end users can and regularly do waive their rights to data privacy by participating in commercial support forums, from Facebook to Inspire.com or PatientsLikeMe. But they do not have access to the data repositories they are contributing to, even when the value of that data is part of a for-profit company’s business model.

What would be incredibly beneficial to researchers, grassroots organizations, and journalists everywhere, would be a simple, standard license to attach to a survey ensuring that respondents know how their data will be used and stored, and giving them the option to include their contact information (or not!) for follow-up interviews and questions.

This type of legal infrastructure could pave the way for creating new open datasets, whether under public domain or another license. It also happens to be something that my company needs to fulfill our original mission, and we may be able to assist with the cost of creating such a standard.

Please feel free to reach out if there is interest.

I read one of the previous threads.

I still think there’s a fairly deep divide about what the preferred form of modification is. Unfortunately, even more than with traditional software, it depends on what you want to do.
Some contributions to the discussion talked about fine tuning, transfer learning, and other techniques where the model weights actually are more of a preferred form of modification than the data.

As a software developer, retraining something like Mistral 7b is entirely outside of my budget or current capability.
If I want to exercise my freedom to use, to modify, or to share a LLM, the model weights are the form I need and want.
Modifying a model from its weights through techniques like pre-training is how I would go about modifying a model.

On the other hand, if I want to exercise my freedom to understand a model, then I start to care about the training data. In many ways data information is going to help me more with that understanding than the raw data. I am probably going to gain more information than a description of the data than the raw data itself.

I’d like to explore the possibility that a lot of this disagreement boils down to how we rank each of the freedoms.
For myself, i rank the freedom to use and modify most highly. I care most that there be high-quality LLMs available without use restrictions with open components. So for me, I would prefer to relax the requirements on data even beyond what is in draft 0.0.8. My fear is that if there are not high quality AI models that meet the definition, it will not have enough initial relevance to have market impact.

On the other hand if you are a researcher, focused on understanding as the primary freedom you care about, then getting as strong of a requirement for openness of data as you can might be important.

4 Likes

If you care solely about your ability to use and modify, and you do not care about the legal status of training data, why do you care about the OSI’s label?

That is, will the absence of an approved OSI label make you any less likely to use this?

On the other hand, your desire to have a right to freely use and modify is almost certainly infringing on the rights of many other software developers whose code or documentation is in the training data.

For example, AGPL code I released is in the training data for multiple models. Your desire for a simple right to use does not exist in a vaccum…

because I am a new user, it will only let me include two links; I have saved a version of this post with all the links if anyone wants it.
As part of coming up to speed on this discussion, I did a lot of archeology through my email archives into previous discussions about Preferred Form of Modification. I am sharing some of that here. I hope that will help people coming from the AI space understand how complex the issue of preferred form of modification already is in the software space. For those of us with an open-source/free-software background, perhaps some of this will still be useful.

Not OSI’s Traditional Area

I’ve been in Debian for around 25 years, and so I’ve been familiar with the work of OSI, but have not been directly involved in OSI now. My impression is that OSI has traditionally focused on licenses and whether they were open-source. If OSI has focused at all on whether a particular software system was open, it didn’t appear to be a major activity of OSI from the outside. (I understand why the focus is different in looking at AI systems.)
Some licenses talked about the preferred form of modification, many did not.
But for organizations like Debian or fedora that needed to figure out whether software was “open enough”, preferred form of modification becomes a critical question. Does the software include the source code effectively boils down to whether it includes preferred form of modification.

Long Debate

The debate goes back at least as far as 2003
We’re still debating it as of this Sunday.
In 2016 I talked about how I did not think there was a consensus within the Debian community about what the preferred form of modification was, and how we made progress anyway by delegating to a small team.

A Preferred Form

In recent years, the idea that there is not always a single preferred form of modification for a work has gained traction; see here for one of what I think is the earliest discussions that led to that.
Thinking about multiple preferred forms of modification comes up for examples like sound files where one person may want to edit a csound score for example and another may want to edit by running through a sound editor. One person may want to use a Gimp project and another may wish to edit in an editor that works with jpegs or pngs.

Preferred Form Changing Over Time

There are cases where Debian has accepted binary machine code as the preferred form of modification where previous source code had been lost.
There are cases where we’ve concluded that a binary blob of firmware is a preferred form of modification for firmware for a particular chip (and many more where we eventually concluded they were not).
Arguing about how to handle the firmware issue is probably the most divisive issue Debian has faced going from 2004 through 2008.
A lot of that related to thinking about what counted as source code and what preferred forms of modification were acceptable.

Does Preferred Form Depend on the Use

One common question is whether the embedded form of modification changes when one project is embedded in another. In a particular instance, the handlebars javascript library was embedded in a ruby gem to make it easier to use in Ruby applications. The maintainers of that gem planned to modify Handlebars by taking new versions from the upstream. They argued that for their use case the preferred form of modification was to take the minified release artifact from Handlebars (in effect a compiled artifact) and include it in their gem.
That was very convenient if you wanted to upgrade to the next version of handlebars, but less convenient if you wanted to make an arbitrary change to Handlebars.
Yet claiming that the full Handlebars sources and build system needed to be available significantly complicated that gem.
Debian made one decision; other communities have made different decisions facing the same issue.

Conclusions for this Discussion

Traditionally deciding what counts as Preferred Form of modification has been left to small communities rather than being decided for the entire Open Source ecosystem.
Different communities have made different decisions.
The definitions and thinking about what counts as preferred form of modification have evolved over time, and have often been made by small groups because greater consensus was not possible.
At least in Debian, I think there’s general agreement that how free something is is not about the quality of the software. If some artifact is lost or never maintained, that clearly is no longer the preferred form of modification.
We’ve agreed that Autoconf output (a nasty shell script) is a preferred form of modification in cases where someone found it more convenient to patch that shell script on an ongoing basis rather than changing the autoconf input and rerunning autoconf.
One of the questions that Debian asks is whether the original author has some advantage because of the preferred form of modification. If they have things that the rest of the wold does not, that’s a sign that perhaps we don’t have all the source code.
But if the best we have for a particular work is just not as good as we wish it were, that does not decrease freedom.
The AI analogy would be what if someone does a bad job of maintaining information about how they trained a model. Imagine that parameters like learning rate and optimizer selection are command line arguments, and no one ever wrote down which settings were used.
Should that mean that the AI system is not open? Or does it just mean that it is open but not very methodical?
Is that a question we even want to answer for the entire community?

@hartmans feel free to include all your links, you should be able to now. Thanks!

@hartmans, I remember the discussions and have been a Debian user since first switching to nix from FreeBSD. I agree that the OSI label could affect Debian, but would note that the vast majority of “normal” Linux users today, even those using Debian-based distros, aren’t bound by Debian decisions and almost always modify sources.list within minutes of install…

Similarly though, the Debian community also put the GFDL to vote in 2006 over its modification restrictions:
https://people.debian.org/~srivasta/Position_Statement.xhtml
https://www.debian.org/vote/2006/vote_001

If I were active in the Debian community today, I would ask whether any LLMs, especially “aligned” chat models, could ever really meet the Debian Social Contract:
https://www.debian.org/social_contract#guidelines

Not only do aligned chat models restrict your use in instruction-following, but such models obviously embed the biases and discrimination of pretrain and alignment data creators.

In practice, Debian users will probably never install a model from APT anyway, since models are typically distributed as layer partitions from HuggingFace or another source. I can’t think of many 100GB+ tars on the mirrors…

I would like to be heard a bit differently. My goal was not to talk about how this affects Debian, but rather, to explore what it meant that OSI is getting into thinking about the realm of preferred form of modification. Here are the aspects of the long history of thinking about preferred form of modification I would like to bring into this discussion:

  • OSI is jumping into the deep end of thinking about preferred form of modification: what is the preferred form of modification for things that are not traditional executable code and for which multiple modification approaches are commonly used. The broader community has completely failed to come to consensus on this in a quarter century. It is unsurprising we are finding this hard today.

  • Was there a long discussion about the often-repeated assertion that training data is the preferred form of modification for a machine learning model, or was that something people coming into the discussion just assumed? I know at least in Debian’s discussions of ML policy, that was more or less an unquestioned initial assumption. If there was a long discussion of that topic, can I have a pointer so I can be more informed?

  • I absolutely agree that training data is a preferred form of modification for machine learning models. But are there others? (The obvious candidate being weights)

  • If there are other preferred forms of modification, how do we guarantee the freedoms we are working toward (using, modifying, understanding and sharing?) I think the obvious question is likely, if there are other preferred forms of modification for models such as weights, are there mechanisms to get adequate freedom to understand those models? Is data information the right answer?

  • In the broader community we have flexibility in allowing the definition of preferred form to evolve and to be decided in small groups. This flexibility has been valuable, and I am concerned that codifying preferred form of modification in a definition will reduce flexibility we need. Are there ways of getting this flexibility here? As an example, efforts like the NIST AI-601 in the US have a significant focus around data and data provenance as a factor to be evaluated in considering the risk profile of AI systems. I am sure there are other efforts in other areas of the world. So the landscape around how much data information people need to share to be successful is changing rapidly.

  • If I’m right that we’re very unlikely to come to a real consensus on the question of what is the preferred form of modification, what do we do? Is there another way to frame the problem that we can get consensus on? How comfortable are we with OSI somehow choosing as an organization and moving forward?

My hope was to learn from Debian and the broader community’s discussions about preferred form, not to focus on how this affects Debian.

2 Likes

What requirements does data information place on derived works? This question was briefly brought up in a previous topic, but I am failing to find the post and apparently didn’t book mark it.
Specifically, imagine that I fine tune Mistral 7b in order to develop a system to be good at helping guide discussions in the open source community.
I release the training data where I can, and pointers to the data where I am concerned about legal implications.
I release the code I used for training, and the code for my resulting software system, say under the Apache 2.0 license.
I argue that I have given a preferred form of modification and met the conditions of draft 0.0.8’s text. (I have not looked at the checklist).
You may not be able to fully understand my base model, but if there is a model you like better, you can substitute that model and apply my training data to that model.
For a system like this, substituting the base model is a more natural approach to modification and understanding than digging into the base model.
I’ve certainly started AI projects and tried applying fine tuning data to multiple base models, and so that sort of substitution definitely works.
I think that if I used Llama2 and that I did flow down their use restrictions as required by their license, the result would be not open because it would not guarantee the freedom to use.
At least in the case where the derived work is released by someone other than Mistral–someone who is not in a position to cure the ways in which Mistral is not open–I argue that this derived work should be open.
Let’s take a look at the ways in which Mistral is not open:

  • Some of its code is not available under an open license. I think that’s a big deal in looking at whether a sufficiently skilled person could reproduce something substantially similar to Mistral. However, note that there is plenty of code under open licenses to train and infer with the Mistral weights in multiple forms. The issue with the specific code not being available is not that you can’t train a Mistral model, simply that your results might differ enough that you could not produce something substantially similar.

  • Data information. I think the big question is whether there is a data information break between a derived work and a base model. I argue that for a derived work, the answer should be that providing a pointer to a base model and enough specificity to retrain the derived work should be enough to meet the data information requirement.

I have been a Debian user since 1998, and for the past 25 years, I have always had several Debian developers around me. I am pleased to see Debian affiliates here as well.

The claim that training data for machine learning models is the preferred form of modification has recently been asserted by several industry figures, although it may have existed before. However, I personally feel that these claims do not adequately explain why training data is essential. While it is undoubtedly a preferred form, I am unsure if it is the most preferred.

Additionally, OSI is not an organization that determines the recommended form of modification. OSI has learned from the experience accumulated within the open-source community, including Debian. Therefore, OSAID will learn from the experience of the machine learning community, and at this point, it seems we have no choice but to have a flexible and ambiguous definition.

7 Likes

This is an important point. Even in Japan, which is considered one of the most lenient countries regarding machine learning under copyright law, training data is allowed to be used solely for the purpose of machine learning. Extracting and enjoying some parts of the data as content is not permitted. Therefore, distributing training data that people can use as regular copyrighted material could be illegal.

Additionally, I agree with your points on the issues of public domain and copyright protection periods.

3 Likes

@shujisado, if you were presented with evidence that it is trivial to extract training data from weights, would that change your perspective?

That is, for many models, distributing weights is functionally equivalent to distributing the training data.

Can I get a citation on that? One of the papers I read earlier this year suggested that LLMs memorized I think it was 1-2% of their training data; certainly less than 10%, and talked about how much data could be extracted. The conclusions appeared to be that there definitely was a significant fraction of training data that could not be extracted.
I just spent 30 minutes looking for the link and failed. But this is definitely one of those sub-discussions where links to papers are what we need.

but especially if it were true that training data could be extracted, I think that would be a stronger argument in favor of weights being a preferred form of modification.
If some people find that form useful for their tasks, and if it is not lossy, that argues in my mind that it is a preferred form of modification.

1 Like
  1. From a normative or legal perspective, why does it matter what the % is? Can I violate the AGPL if it’s just a small % of Ghostscript? Knowingly distributing or providing access to a model that generates infringes on 10% of millions of documents is still a pretty material risk surface.

  2. Technically speaking, the answer depends substantially on things like preprocessing/deduplication, tokenizer, relative frequency and length of sequence prefix, ratio of parameters to pretrain tokens, number of epochs, etc. The empirical consensus is that larger models are increasingly likely to memorize and output content that occurs more frequently.

  3. For cites, you can check some of the references I included my post earlier this week:
    GPL2 kernel source as an MIT-licensed model: Is this really open?

These are only the oldest and most “reputable” sources, e…g, from Google, Eleuther, and ex-Meta/Patronus.

You can also search scholar.google.com or arxiv.org for (“memorization” | “copyright” | “eidetic”) + (LLM | “large language”) and find many different results.

  1. Related to modification, there is a growing body of research on modifying aligned models to “unlock” various behaviors.
    In particular, representation/control/steering techniques make it very easy to de-align or incentivize memorization. You can try yourself:

A concern I have about “data information” is how is it going to be assessed? Is this something we’ll be able to evaluate short of actually trying to recreate the AI system in question?

3 Likes

4 posts were split to a new topic: An open call to test OpenVLA

FYI, even when setting software RNG seeds, disabling any probabilistic hardware featuresets, and fixing all software/driver versions, it is practically impossible to produce the same trained model unless you have access to the exact same hardware environment in very simple setup (e.g., single-node, single GPU-sized models).

See, e.g.:

Provable “sameness” is probably something to altogether avoid…

In a typical AI model, weights are the result of computations. I am discussing under the assumption that the same intellectual property rights that apply to the training data do not apply to the weights. To change this assumption, we would need to await discussions in various countries.

The command line below also creates a “weight” file that is the result of compution.
$ gzip linux/fs/ext4/acl.c

Do you think GPL-2.0 does not apply to the acl.c.gz version?