Explaining the concept of Data information

Senficon · June 14, 2024, 3:13pm

I have really enjoyed following the discussion so far, thank you @stefano for summarizing the case for data information.

I would like to make a few points about the implications of copyright law for the application of open source principles to the subject of AI, especially for the question of training data access. They largely lead me to the conclusion that data information is a viable concept for the purposes of the OSAID.

The definition of open source software has a legal element and an access element – the access element being the availability of the source code and the legal element being a license rooted in the copyright-protection given to software. The underlying assumption is that the entity making software available as open source is the rights holder in the software and is therefore entitled to make the source code available without infringing the copyright of a third party, and to license it for re-use. To the extent that third-party copyright-protected material is incorporated into the open source software, it must itself be released under a compatible open-source license that also allows the redistribution.

When it comes to AI, the situation is fundamentally different: The assumption that an open source AI model will only be trained on copyright-protected material that the developer is entitled to redistribute does not hold. Different copyright regimes around the world, including the EU, Japan, and Singapore, have statutory exceptions that explicitly allow text & data mining for the purposes of AI training (I will leave aside the discussion of fair use in the US and several other jurisdictions here, as it is a controversial subject in many online discussions and it isn’t central to my argument). The EU text & data mining exceptions, which I know best, were introduced with the objective of facilitating the development of AI and other automated analytical techniques. However, they only allow the reproduction of copyright-protected works (aka copying), but not the making available of those works (aka posting them on the Internet).

That means that an open source AI definition that would require the republication of the complete dataset in order for an AI model to qualify as open source would categorically exclude open source AI models from the ability to rely on the text & data mining exceptions in copyright – that is despite the fact that the legislator explicitly decided that under certain circumstances (for example allowing rights holders to declare a machine-readable opt-out from training outside of the context of scientific research) the use of copyright-protected material for the purposes of training AI models should be legal. This result would be particularly counterproductive because it would even render open source AI models illegal in situations where the reproducibility of the dataset would be complete by the standards discussed here on the forum.

To illustrate: Imagine an AI model that was trained on publicly accessible text on the Internet that was version-controlled, for which the right holder had not declared an opt-out, but which the right holder had also not put under a permissive license (all rights reserved). Using this text as training data for an AI model would be legal under copyright law, but re-publishing the training dataset would be illegal. Publishing information about the training dataset that included the version of the data that was used, when and how it was retrieved from which website and how it was tokenized would meet the requirements of the OSAID v 0.0.8 if (and only if) it put a skilled person in the position to build their own dataset to recreate an equivalent system. Neither the developer of the original open source AI model nor the skilled person recreating it would violate copyright law in the process, unlike the scenario that required publication of the dataset. Including a requirement in the OSAID to publish the data, in which the AI developer typically does not hold the copyright, would have little added benefit but would drastically reduce the material that could be used for training, despite the existence of explicit legal permissions to use that content for AI training. I don’t think that would be wise.

While I support the creation of public domain datasets that can be republished without restrictions, I would like to caution against pointing to these efforts as a solution to the problem of copyright in training datasets. Public domain status is not harmonized internationally – what is in the public domain in one jurisdiction is routinely protected by copyright in other parts of the world. For example, in US discourse it is often assumed that works generated by US government employees are in the public domain. They are not, they are only in the public domain in the US, while they are copyright-protected in other jurisdictions. The same goes for works in which copyright has expired: Although the Berne Convention allows signatory countries to limit the copyright term on works until protection in the work’s country of origin has expired, exceptions to this rule are permitted. For example, although the first incarnation of Mickey Mouse has recently entered the public domain in the US, it is still protected by copyright in Germany due to an obscure bilateral copyright treaty between the US and Germany from 1892. Copyright protection is not conditional on registration of a work and no even remotely comprehensive, reliable rights information on the copyright status of works exists. Good luck to an open source AI developer who tried to stay on top of all of these legal pitfalls.

Bottom line: There are solid legal permissions for using copyright-protected works for AI training (reproductions). There are no equivalent legal permissions for incorporating copyright-protected works into publishable datasets (making available). What an open source AI developer thinks is in the public domain and therefore publishable in an open dataset regularly turns out to be copyright-protected after all, at least in some jurisdictions. Unlike reproductions, which only need to follow the copyright law of the country in which the reproduction takes place, making content available online needs to be legal in all jurisdictions from which the content can be accessed. If the OSAID required the publication of the dataset, this would routinely lead to situations where open source AI models could not be made accessible across national borders, thus impeding their collaborative improvement, one of the great strengths of open source. I doubt that with such a restrictive definition, open source AI would gain any practical significance. Tragically, the text & data mining exceptions that were designed to facilitate research collaboration and innovation across borders, would only support proprietary AI models, while excluding open source AI. The concept of data innovation will help us avoid that pitfall while staying true to open source principles.