Proposal to handle Data Openness in the Open Source AI definition [RFC]

TIL “Open” is now well-defined covering 3 of 4 freedoms (study being absent):

“Open data and content can be freely used, modified, and shared by anyone for any purpose

And RC1 doesn’t meet even this looser standard. In other words, the Open Source AI Definition (OSAID) is not, in fact, Open according to its accepted definition.

Or at least those claiming it does have not shown it does, while those who dispute the claim have shown counterexamples.

Agreed. Either you fully protect the four essential freedoms, or you do not.

While study does start to venture into the Open Science realm, it was deemed important enough to be considered essential.

Agreed. Adding a definition that is both difficult to meet (in the spirit of the rules) and yet easy to circumvent (to the letter of rules) does not help and likely hurts our cause.

Adding one more document into the discussion.

[1]‘Model AI Governance Framework for Generative AI’, AI Verify Foundation. Accessed: Oct. 08, 2024. [Online]. Available: MGF for GenAI – AI Verify Foundation

[2] ‘Singapore proposes framework to foster trusted Generative AI development’, Infocomm Media Development Authority. Accessed: Oct. 08, 2024. [Online]. Available: Model AI Governance Framework 2024 - Press Release | IMDA

[1, p.11]

Facilitating Access to Quality Data

As an overall hygiene measure at an organisational level, it would be good discipline for AI developers to undertake data quality control measures and adopt general best practices in data governance, including annotating training datasets consistently and accurately, and using data analysis tools to facilitate data cleaning (e.g., debiasing and removing inappropriate content).

Globally, it is worth considering a concerted effort to expand the available pool of trusted datasets. Reference datasets are important tools in both AI model development (e.g., for fine-tuning) as well as benchmarking and evaluation.21 Governments can also consider working with their local communities to curate a repository of representative training datasets for their specific context (e.g., in low resource languages). This helps to improve the availability of quality datasets that reflect the cultural and social diversity of a country, which in turn supports the development of safer and more culturally representative models.

1 Like