Originally published at: https://opensource.org/blog/why-datasets-built-on-public-domain-might-not-be-enough-for-ai
Common Corpus is a public domain dataset for training large language models (LLMs). Boasting 500 billion words in multiple languages, drawn from various cultural initiatives, it offers researchers a powerful tool to develop smaller and more efficient LLMs. It should not be abused as a tool to promote public policies that expand the reach of copyright law.