On the current definition of Open Source AI and the state of the data commons

Looking at the site for the Minerva series LLMs, it states that they use datasets such as CulturaX, OSCAR-2201, OSCAR-2301, and mC4. These datasets are extracted from Common Crawl, and the OSCAR page specifically includes a disclaimer: “Being constructed from Common Crawl, personal and sensitive information might be present.” This means that if any problems arise due to sensitive information, the data may not be publicly released.

Additionally, while it mentions that the training data for Minerva LLMs is sampled from datasets like OSCAR, I could not find where the actual data is located.

This is the last reply to this thread. The thread is a bit long.