Explaining the concept of Data information

mjbommar · June 19, 2024, 6:33pm

From a normative or legal perspective, why does it matter what the % is? Can I violate the AGPL if it’s just a small % of Ghostscript? Knowingly distributing or providing access to a model that generates infringes on 10% of millions of documents is still a pretty material risk surface.
Technically speaking, the answer depends substantially on things like preprocessing/deduplication, tokenizer, relative frequency and length of sequence prefix, ratio of parameters to pretrain tokens, number of epochs, etc. The empirical consensus is that larger models are increasingly likely to memorize and output content that occurs more frequently.
For cites, you can check some of the references I included my post earlier this week:
GPL2 kernel source as an MIT-licensed model: Is this really open?

These are only the oldest and most “reputable” sources, e…g, from Google, Eleuther, and ex-Meta/Patronus.

You can also search scholar.google.com or arxiv.org for (“memorization” | “copyright” | “eidetic”) + (LLM | “large language”) and find many different results.

Related to modification, there is a growing body of research on modifying aligned models to “unlock” various behaviors.
In particular, representation/control/steering techniques make it very easy to de-align or incentivize memorization. You can try yourself: