Open Source AI needs to require data to be viable

stefano · June 6, 2024, 11:52am

Thanks @Mark I think we’re in agreement with the “open” debate. Let’s get back on topic:

Sorry if I sound like a broken records but we need absolute precision otherwise this becomes a debate about the gender of angels. What do you mean exactly by “requires data”? I don’t think that what the OP suggests is reflected in the criteria you used to assign green marks in your paper. Please correct me where I’m wrong.

Take Pythia-Chat-Base-7, a fine-tuned system you’ve reviewed based on EleutherAI’s Pythia 7B, which we’ve reviewed.

Pythia-Chat-Base-7 reveals the training dataset is OpenDataHub. That repository shows that the data is shared as links to Google Drives (if I read the code correctly). If one is interested in retraining Pythia-Chat-Base-7 (why would one want to do that is not for me to ask at this moment, but I reserve the right to ask later) one would have to go to OpenDataHub and get the datasets. But there is also the chance that the data is not there anymore, as links go stale.

If I understand @juliaferraioli’s argument (please chime in), this sort of disclosure isn’t satisfactory of her requirements because links to Google Drives are not versioned, there is no easy way of forking the dataset, there is no hash of the exact dataset used for the finetuning. Quoting from above:

And yet, in your paper the LLM Data box for Pythia-Chat-Base-7 is green.

Similar issue for OLMo 7B Instruct, the model card says:

For training data details, please see the Dolma, Tulu 2, and UltraFeedback documentation.

Which version of Tulu 2 and UltraFeedback was used? That would fail Julia’s criteria, too… And that’s the top scoring model.

Now, even more complicated: the ancestor of Pythia 7B is built on the the Pile dataset. Pythia-7B is one of the most well documented model out there, code, data, papers, community… all open. The dataset it’s built on was public, too but the Pile was subject to a takedown request for alleged copyright violation and is not available anymore.

The Pile is no longer hosted or distributed by The-Eye or afils.
It is however still shared among the AI/ML community presumably for its historical significance and so can still be found via public torrents.

Hashes of the original dataset can be obtained from eleuther.ai at request.

Now what do we do? If I can imagine of an Open Source AI, Pythia-7B must be one. It probably even matched @juliaferraioli’s high bar for a few months. But now, what’s Pythia-7B today?

And yet, all of these systems are at the top of the charts for transparency, and reproducibility. If you look at the preliminary results of the validation phase, Pythia and OLMo both would be Open Source AI, while Falcon, Grok, Llama, Mistral would not because they don’t share Data information.

Is the draft Open Source AI Definition working as intended? Can we please get more eyeballs to review systems?