Initial Report on Definition Validation

stefano · May 31, 2024, 11:04am

That’s the idea: the Preferred form to make modifications lists the basic principles that are unlikely to change in the future, while the Checklist below provides a list of components required to comply with the definition of preferred form.

The validation phase is designed to confirm this hypothesis:

The availability of the components:

Training methodologies and techniques

Training data scope and characteristics

Training data provenance (including how data was obtained and selected)

Training data labeling procedures, if used

Training data cleaning methodology

and the required Code components

is “sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data.”

Mer’s report highlights the difficulties encountered by the volunteer reviewers, especially the bullet point “Elusive documents” is telling.

If the volunteer experts in the working group can’t find the components, then we can’t really evaluate the systems and properly test the hypothesis. We need a better way.

Maybe we need to ask the developers of the AI systems to fill in a survey, they’d provide all the details themselves. This may work and I can see people leading Open Source friendly projects like @stellaathena @Danish_Contractor and @vamiller filling in such form. But Meta, Mistral, xAI and such? They won’t and will likely continue to call their systems “open source” because they want to get out of the obligations of the EU AI Act or gain some other market advantage. Maybe that won’t be a big issue because eventually there’ll be enough public pressure to censor those who abuse of the term Open Source AI, as there is public pressure now for those who abuse of Open Source for software.

What do you think?