Towards Best Practices for Open Datasets for LLM Training

nick · January 16, 2025, 10:43pm

Hi everyone,

I would like to share a recent paper that came out from an event organized by Mozilla and EleutherAI that convened 30 scholars and practitioners to create normative principles and technical best practices for creating openly licensed LLM training datasets.

https://arxiv.org/pdf/2501.08365

Let us know what are your thoughts about this paper!