The data requirement: "Sufficiently detailed information" for what?

moodler · March 26, 2024, 3:32pm

It’s not that I thought full data was a requirement, just that things were vague enough to go either way.

For the required part:

“Sufficiently detailed information” means nothing … I would say we should use words about the utility of the information. To be exact, it should be “Sufficiently detailed to allow someone to replicate the entire dataset”. Openness means reproducibility.

In the checklist:

There’s no mention of the data - only code and model. I would add the above definition to this checklist for absolute clarity.

For the optional part:

“data sets” may be precise but may not be clear enough to ordinary people. I would mention “as downloadable files” in there.

Cheers

stefano · March 27, 2024, 11:18am

This is such a great question, it deserves its own thread.

I’ve learned that the word “reproducibility” is loaded and may not be the correct one to use. By reading the comment you left on HackMD, you’re thinking of requiring enough information to build an equivalent dataset. I think that’s one of the objectives of the data transparency requirement.

I’d like to hear more experts chime in. Draft 0.0.6 says:

Data : Sufficiently detailed information on how the system was trained, including the training methodologies and techniques, the training data sets used, information about the provenance of those data sets, their scope and characteristics; how the data was obtained and selected, the labeling procedures and data cleaning methodologies

This is really just a transparency requirement, using words from the EU AI Act, which Open Future already criticized for not going far enough.

What if the requirement in the Open Source AI Definition was more like:

Data: Sufficiently detailed information to allow someone to replicate the entire dataset.

How would this statement work with all the different ways to train machine learning systems, in general, including systems using federated learning, assisted learning and other techniques where the dataset cannot be replicated?

Minor questions:

this is a known issue I mentioned it here.

The definition is not aimed at ordinary people. The exact definition of the components is in the Model Openness Framework paper.

florihas · April 7, 2024, 11:53am

very new to this discussion so forgive me if this has already been mentioned but I think the indicators in this paper by Bommasani et al. ([2310.12941] The Foundation Model Transparency Index, p. 78-80) give quite a good operationalization of data transparency: data size, sources, creators, source selection, curation, augmentation, harmful data filtration, copyrighted data, license and Personal information (not implying that this is exhaustive and that they all apply here). They’ve applied this to major foundational models as part of their Foundation Model Transparency Index.

Addendum just to be clear: I fully agree with the notion that training data sets must be replicable.

zack · April 8, 2024, 8:56am

Hello @florihas, to my recalling this specific work has not been mentioned before. I’ve looked at it, and I think it’s a great set of requirements. While I’m firmly in the “training data should be available” camp, short of that, these requirements would do a great job at improving the state of the art on data transparency for “open weight” models.