Explaining the concept of Data information

mjbommar · August 27, 2024, 3:04pm

The de minimis doctrine makes sense in the context of single works; it doesn’t make sense at the scale and nature of output. Also, you’re ignoring the political economy and legal reality of who is infringed upon. For example, Llama 3.1 8B spits out the lyrics to most pop culture songs or the first chapter of Harry Potter without even jailbreaking. Good luck arguing de minimis defense there.
When discussion of these models is paired with UBI for displaced jobs, OpenAI’s own research published in Science estimates practically half of jobs could be impacted, and Mira Murati claims that some jobs should never have existed in the first place, I don’t think you have a fair use defense.

The models have taken training data produced by professionals with certain skills and are being used to displace other individuals with the same skills. This is, by definition, an intentional economic substitution and the very antithesis of fair use defenses.

Further, the none of the other parts of the test are met. Here’s a prior argument I’ve shared:

(1) Purpose and Character of the Use:

The primary motive of the companies is clear commercial exploitation, demonstrated by billions of dollars of private company value and public company profits created, and the significant displacement of jobs as noted by Eloundou et al. or the obvious implications of “time-saving” marketing material for products like Office 365 Copilot.

(2) Nature of the Copyrighted Work:

Many of the works are highly creative and receive robust copyright protection. Mira Murati highlights the potential disruption in creative industries, questioning the value of human-created content. Additionally, Mustafa Suleyman’s assertion that web content is inherently “fair use” since the ‘90s does not align with the legal protections copyright law affords to creative works. It is clear that the defendants neither acknowledged nor differentiated between the works.

(3) Amount and Substantiality of the Portion Used:

The scale of use is excessively comprehensive; “downloading the entire Internet” diminishes the originality and value of individual works. Moreover, as Biderman et al. note, LLMs can memorize and potentially reproduce exact sequences, posing significant risks of incorporating and disclosing verbatim copyrighted content, thus further undermining the creators’ rights. Recent research from Eleuther and Microsoft itself even shows that even smallers model can recite at more than 80% of their training data.

(4) Effect on the Potential Market or Value:

Eloundou et al. indicate that the use affects both high and low-income jobs across various industries, likely displacing significant market segments for original creators. The societal impact is severe, with Larry Summers and Mira Murati highlighting the potential loss across professional, technical, and creative forms of labor. Calls for universal basic income to support labor displacement, coming from the CEOs of both OpenAI and Anthropic, are a direct admission of the scale and magnitude of the effect on the market.

Conclusion: In conclusion, none of the four factors of §107 are met, and the severity of (4) alone would be enough to decide against the defendants.