Reimagining data for Open Source AI: A call to action

Originally published at: Reimagining data for Open Source AI: A call to action – Open Source Initiative

The Open Source Initiative (OSI) and Open Future have published a white paper: “Data Governance in Open Source AI: Enabling Responsible and Systematic Access.” This document is the culmination of a global co-design process, enriched by insights from a vibrant two-day workshop held in Paris in October 2024.

1 Like

This paper is a major deliverable, next to the Open Source AI Definition v.1.0 at the level of importance. It’s a complement to the other paper about data (by @stellaathena and Mozilla Towards Best Practices for Open Datasets for LLM Training) which more specifically targets LLMs and foundation models (or, as they’re called in Europe, general-purpose AI). Both are IMO a must-read for anyone interested in this topic.

The Data Governance in Open Source AI paper by @Alek_Tarkowski provides a foundation for a deeper understanding of these complex issues.

It’s clear from these two papers that the concept of open data, raised multiple times on this forum, doesn’t cover the whole spectrum of possible options for training LLMs. The idea of data commons with the expansion of the stakeholders allows for a better modeling of reality.

I’d love to see comments to these papers from the data experts here.

1 Like

Did anyone get a chance to read the white paper? We look forward to your comments. Thanks! :pray:

The recommendations in this white paper, such as enhancing data transparency, promoting data commons, and advancing open data, are fundamentally important across all jurisdictions. In Japan as well, many concerns regarding governance, ethics, and transparency are relevant, making this paper a welcome contribution.

However, I feel that it may be difficult to attract significant interest in Japan and China, while European countries are likely to be the primary target audience.

Japan already has a legal framework well-suited for AI development, where permission from rights holders is not required to use copyrighted works for AI training. Therefore, it seems unlikely that there will be strong interest in data commons or open data. Similarly, in China, all data falls under government control, making the idea of data openness an unlikely focus of attention. I was reluctant to write this as it includes some negative observations, but the areas of interest differ depending on the region.

Furthermore, I am more interested in the openness of synthetic data usage. The use of synthetic data is expanding rapidly, and U.S. AI vendors are likely to impose restrictions on the use of AI-generated outputs. Given this situation, I have yet to form a solid opinion on what the ideal approach should be. This white paper does not yet include any noteworthy recommendations specifically addressing synthetic data.

1 Like

Hi @shujisado -san!

Thank you for your constructive comments.

You bring a valid point about the paper’s perspective, which is focused on understanding the challenges and opportunities for data governance in the “West”, particularly in the U.S. and Europe. Thank you for always expanding our view and bringing the perspective from Asia, in particular Japan and in this case China as well.

Taking a more careful look at synthetic data is also spot on, as it’s growing in importance.

1 Like