Explaining the concept of Data information

jasonbrooks · June 25, 2024, 10:41pm

We should be able to feel confident that “a skilled person can recreate a substantially equivalent system using the same or similar data” isn’t an empty promise.

markcollier · June 27, 2024, 7:41pm

Thank you. I found this explanation to be extremely valuable and is helping to inform my opinion on this hot topic in the context of the OSAID.

shujisado · July 2, 2024, 12:40pm

Julia Ferraioli-san has published a blog post regarding OSAID, and it would be better to discuss in this forum thread. Even for someone like me, who struggles with English and often has to rely on translation engines, her arguments were clear and easy to understand.

Among the points Julia-san raises, questioning the significance of defining open-source AI, criticizing the use of vague terms like “skilled person,” and debating the appropriateness of using the OECD definition are issues that have already been discussed in this forum. These are important issues to recognize but not unsolvable problems.

However, the issue of whether datasets should be mandatory or not might need more careful discussion. I was a bit disappointed by her words at the end of the article that seemed to give up on legal feasibility, but her argument in the middle of the article that the four freedoms cannot be exercised resonates with what many people feel.

At this point, I believe that a “complete dataset” is not always necessary, and maintaining the completeness of a dataset is extremely difficult. Therefore, I find the current list of mandatory components to be reasonable. However, I am not actively welcoming the removal of the requirement for datasets. The general public will likely demand freedom for datasets as well. We must gather enough materials to convince them.

Therefore, I hope that more experiments like the following thread will be conducted.

mjbommar · July 3, 2024, 12:28am

Perhaps the team could start by taking the model that I freely open-sourced with complete data, which can be trained on a single consumer GPU, and see for yourself what you find when attempting to make it deterministic under ideal circumstances.

You might also review the research like I previously shared or ask other experts who have trained models at scale (since although I have, you don’t seem to value my opinion):

/arxiv.org/pdf/2202.02326 (link blocked by admin?)
/pytorch.org/docs/stable/notes/randomness.html (link blocked by admin?)

For example, I’m sure the Eleuther team would happily share their experience on this topic. Just ask them.

kjetilk · August 21, 2024, 8:37pm

While I find this to be a quite compelling argument for the wording in v0.0.8, I’m baffled. It then seems to me that a purely extractive business model in which works are exploited but no compensation for the author is forthcoming is permitted, but a model in which value is returned to society (although not necessarily to the author), is banned?

Or is there some provision in there in which the author of copyright-protected work is then entitled to royalties for a proprietary model?

shujisado · August 22, 2024, 12:14pm

Whether it’s an Open Source model or a closed model, there is the freedom to provide some form of compensation for the copyrighted works used in AI training.

I specifically mentioned compensation rather than “royalty” because “royalty” generally refers to a fee for the use of copyrighted material. In Japan, where I live, the use of copyrighted works for machine learning purposes is considered an exception under copyright law.

kjetilk · August 22, 2024, 12:38pm

Certainly! Compensation is orthogonal to the openness of a model.

Yes, but I used “royalty” for exactly that reason. If the use of copyrighted material in a closed, non-published model forces the model producer to compensate the author of the copyrighted material, then I can understand why copyright law makes the distinction between reproduction and distribution into a model, but if not, then I can see how this would benefit the science and arts.

pranesh · August 27, 2024, 1:34pm

IAAL. From a legal perspective both the percentage of infringing activity matters, as well as the nature of the infringing activity.

Firstly, thanks to the ‘de minimis’ doctrine, small amounts of infringement do not count as ‘infringement’ in a number of jurisdictions.

Secondly, even if there’s copyright infringement, there are defences to it. These defences (whether in the form of the “fair use” doctrine, or in the form of specific “fair dealings”) both concern themselves with the amount of infringement as well as the purpose. If the amount is small and the purpose is, e.g., “transformative use” or if it falls under one of the recognized exceptions, then it doesn’t count as infringement. In India, for instance, under Section 51 of the Copyright Act, 1957, you are even allowed to import an infringing copy “of any work for the private and domestic use of the importer”.

mjbommar · August 27, 2024, 3:04pm

@pranesh:

The de minimis doctrine makes sense in the context of single works; it doesn’t make sense at the scale and nature of output. Also, you’re ignoring the political economy and legal reality of who is infringed upon. For example, Llama 3.1 8B spits out the lyrics to most pop culture songs or the first chapter of Harry Potter without even jailbreaking. Good luck arguing de minimis defense there.
When discussion of these models is paired with UBI for displaced jobs, OpenAI’s own research published in Science estimates practically half of jobs could be impacted, and Mira Murati claims that some jobs should never have existed in the first place, I don’t think you have a fair use defense.

The models have taken training data produced by professionals with certain skills and are being used to displace other individuals with the same skills. This is, by definition, an intentional economic substitution and the very antithesis of fair use defenses.

Further, the none of the other parts of the test are met. Here’s a prior argument I’ve shared:

(1) Purpose and Character of the Use:

The primary motive of the companies is clear commercial exploitation, demonstrated by billions of dollars of private company value and public company profits created, and the significant displacement of jobs as noted by Eloundou et al. or the obvious implications of “time-saving” marketing material for products like Office 365 Copilot.

(2) Nature of the Copyrighted Work:

Many of the works are highly creative and receive robust copyright protection. Mira Murati highlights the potential disruption in creative industries, questioning the value of human-created content. Additionally, Mustafa Suleyman’s assertion that web content is inherently “fair use” since the ‘90s does not align with the legal protections copyright law affords to creative works. It is clear that the defendants neither acknowledged nor differentiated between the works.

(3) Amount and Substantiality of the Portion Used:

The scale of use is excessively comprehensive; “downloading the entire Internet” diminishes the originality and value of individual works. Moreover, as Biderman et al. note, LLMs can memorize and potentially reproduce exact sequences, posing significant risks of incorporating and disclosing verbatim copyrighted content, thus further undermining the creators’ rights. Recent research from Eleuther and Microsoft itself even shows that even smallers model can recite at more than 80% of their training data.

(4) Effect on the Potential Market or Value:

Eloundou et al. indicate that the use affects both high and low-income jobs across various industries, likely displacing significant market segments for original creators. The societal impact is severe, with Larry Summers and Mira Murati highlighting the potential loss across professional, technical, and creative forms of labor. Calls for universal basic income to support labor displacement, coming from the CEOs of both OpenAI and Anthropic, are a direct admission of the scale and magnitude of the effect on the market.

Conclusion: In conclusion, none of the four factors of §107 are met, and the severity of (4) alone would be enough to decide against the defendants.

pranesh · August 29, 2024, 7:06am

The question I was responding to was: “From a normative or legal perspective, why does it matter what the % is”, and I answered that from the perspective of multiple copyright laws across the world: It does matter from a legal perspective.

Please note that US copyright law’s “fair use” right/defence isn’t universal (as @Senficon pointed out at the very beginning of this discussion). Furthermore, your analysis of “fair use” in the USA doesn’t take into account transformative use. The definition of “open source” specifically requires that there not be restriction on commercial usage, but many laws and policies give weight to non-commercial vs. commercial usage (as the “fair use” doctrine in the USA does). So, one cannot say jump to conclusions on what ought to be part of " preferred form to make modifications" based on specific laws.

Given that laws in multiple jurisdictions incorporate exceptions for text and data mining, it is clear that not all share your assessment of what is or ought to be restricted by copyright law.

Lastly, I struggle to see how whether something “is almost certainly infringing on the rights of many other software developers whose code or documentation is in the training data” is relevant to whether something ought to count as “open source” or what should be included in " preferred form of making modifications". Neither the definition of “free software” nor the OSD deal with whether or not copyright-infringing software can count as “open source” or “free software”. That’s an orthogonal question.

mjbommar · August 29, 2024, 9:26am

There’s no question that there are many jurisdictions. It’s just that the most important ones are clearly the US and EU for the OSI’s membership, and the EU and member states do not and have never had any “fair use” precedent nor do they have commercial TDM exceptions, so the US is the most important (especially considering many models are exported from this governing law).

Also, I think you should return to my comments. Transformativeness is intrinsically economic, i.e., having to do with economic substitution/harm. Policy-makers and business leaders are literally and seriously discussing the end of many professions as we have known them.

Furthermore, the infringing “thing” is pretrained using an objective function that minimizes the loss of next token prediction - i.e., copying machine accuracy. The idea that copy-machine-like behavior survives to the final product is then, of course, entirely unsurprising and obviously likely to infringe (the opposite of Betamax’s facts).

If you can’t accept that this is probably the definition of non-transformative, then I don’t know what to say. You can argue that statutory changes should be made for policy reasons, but then we’re back to political economy and normative discussions, not legal ones.

shujisado · August 29, 2024, 12:18pm

We, the Asian countries, are accustomed to being overlooked in global discussions, but I would like to say one thing: in Japan, where I live, the use of copyrighted works without permission for AI training is generally treated as an exception under copyright law. Below is the relevant provision from Article 30-4 of Japan’s Copyright Act. Neighboring East Asian countries seem to be moving in the direction of stipulating similar exceptions to copyright, like Japan. Therefore, approaches such as the EU Directive on Copyright in the Digital Single Market are unlikely to become mainstream in Asia.

Article 30-4 It is permissible to exploit a work, in any way and to the extent considered necessary, in any of the following cases, or in any other case in which it is not a person’s purpose to personally enjoy or cause another person to enjoy the thoughts or sentiments expressed in that work; provided, however, that this does not apply if the action would unreasonably prejudice the interests of the copyright owner in light of the nature or purpose of the work or the circumstances of its exploitation:
(ii)if it is done for use in data analysis (meaning the extraction, comparison, classification, or other statistical analysis of the constituent language, sounds, images, or other elemental data from a large number of works or a large volume of other such data; the same applies in Article 47-5, paragraph (1), item (ii));

mjbommar · August 29, 2024, 5:34pm

I appreciate the fact that most analysis is biased towards US/EU focus, but the reality is that 1) the most-commonly used models are trained and distributed by US/EU-based organizations and 2) much of the foundational content (e.g., Common Crawl, Pile) was sourced from Western rightsholders.

Furthermore, I am familiar with the recent statutory movement in Japan, but isn’t Japan a member of WIPO? Isn’t it therefore bound by WIPO treaties like DMCA/WCT?

shujisado · August 30, 2024, 12:22am

Let’s state the reality. Japan, China, and South Korea are each developing language models tailored to their respective countries, and they are actually being used. Even my company was developing LLMs with almost comparable performance to GPT-3.5 up until that time. Many Japanese companies, including mine, have shifted to using or improving upon Western foundational models because they realized it is more efficient for their business. Yes, Open Source AI is important.

Also, when Common Crawl was started, there were Japanese people among the key figures involved at that time. The fact that Common Crawl is based in the United States does not have much significance. While English is the most common language among the content in Common Crawl, I believe Russian is the second most common. Japanese also makes up about 5%.

Yes, Japan is a member of WIPO, and there is even a WIPO office located in Tokyo. Article 30-4 of Japan’s Copyright Act, as mentioned earlier, was established after thorough examination to ensure full compliance with the three-step test of the Berne Convention and the WIPO Copyright Treaty.

mjbommar · August 30, 2024, 1:29pm

I’d love to understand how the updated Articles 30 and 47 are either clear on this topic or compatible with WCT/DMCA.

If anything, my understanding is that it 1) expressly allows piracy at training time, in breach of WCT, 2) strengthens a typical TDM exception for private reproduction, but 3) creates a fair use-like standard whereby the economic impact of use is evaluated against both the model developer and subsequent users.

I get that everyone wants good models for free, and we all want to see truyl open competitors to Meta or OpenAI, but that doesn’t mean we can ignore the laws and rules we’ve all agreed to…we’re just going to undermine the only system of cooperation we have left.

Senficon · September 3, 2024, 1:24pm

Similar misgivings have been voiced by the open science community: EU copyright law may allow the making of reproductions of copyright-protected content for research purposes, but not necessarily the making available of the research corpus used to third parties, thus negatively impacting the reproducibility of research results and open access principles. The rationale behind placing more stringent restrictions on the making available of copyright-protected works than on the making of reproductions is the belief that making available entire works to the general public will adversely affect the rights holders’ ability to generate revenue from selling access to the works to a much greater extent than reproductions for a particular purpose will. That can sometimes lead to undesirable societal side-effects.

kjetilk · September 3, 2024, 2:07pm

Thanks, @Senficon ,

Yeah, that is assumption I can see they are making. I’m afraid they will be mistaken, but regardless of my opinion, I would certainly think such policy should be made only on very solid evidence from extensive research.

Topic		Replies	Views
A Journey toward defining Open Source AI: presentation at Open Source Summit Europe Open Source AI	2	56	October 1, 2024
Reimagining data for Open Source AI: A call to action Open Source AI	4	73	February 2, 2025
Data Transparency in Open Source AI: Protecting Sensitive Datasets Open Source AI	3	111	October 10, 2024
Open Source AI Definition on the road: Looking back and forward Open Source AI	1	664	October 10, 2024
Rahmat Akintola: Voices of the Open Source AI Definition Open Source AI	1	37	October 10, 2024

Explaining the concept of Data information

Related topics