The OSAID requires training data to be shared

nick · September 30, 2024, 6:03pm

Dear members,

We would like to sincerely thank everyone who provided feedback on the OSAID draft. Your insights have been incredibly helpful in refining the text.

Based on your valuable input, we have made updates to the draft to better align with our community’s needs and expectations. We encourage you to bring fresh ideas and respectful arguments to the conversation.

We’ll soon be launching Release Candidate 1 of the OSAID. We would like to share a comment we have added to the FAQ:

The Open Source AI Definition requires training data to be shared, and that has been the case since early in the community-driven process. If you have the rights to share the training data, you must do so. If you do not have the rights to share it, you are required to provide a great deal of specific details about that data before you can claim your AI to be Open Source. It’s an approach that provides the maximum degree of transparency within the patchwork of global copyright laws and still gives users the ability to share, modify, and distribute Open Source AI systems.

We appreciate your continued participation and support in making this forum a respectful and constructive space.

Kind regards,
Nick

lumin · September 30, 2024, 10:07pm

I’m surprised since you would like to make a major change like this for the RC1 version. But it feels like to be in the correct direction. I’m looking forward to seeing the updated draft.

The current FAQ still looks a bit vague to me. Under what kind of license should the training data be shared? Is open-access (e.g., with academic-use only claim) fulfill “to be shared”? Or does it require a OSI-approved license for the dataset (covering images, texts, etc.) itself?

Shamar · September 30, 2024, 11:34pm

Sorry @nick, maybe I face a language barrier here, but I’m not sure I understood the new FAQ.

The new OSAID version requires training data availability or not?

I mean: at a first look, it seems that the new FAQ makes the loophole the community have been reporting for months explicit, instead of fixing it.

I will be more than happy to endorse an Open Source AI definition that won’t enables black box builders to elude the legal and technical scrutiny required by the AI Act, but technically speaking, such change would rule out “Unshareable non-public training data” that still appear in the FAQ.

Did you remove them from the Release Candidate? Or you left them and explained that only AI systems that are trained only from Open or Public datasets can be qualified as Open Source?

If so, many thanks for fixing the issue and thanks to @anon18632855 for his analysis of the working group dynamics that shown how such an erroneous conclusion were drawn.

anon18632855 · October 1, 2024, 5:45am

Thank you @nick.

@lumin: I think open data is a bridge too far for now, and that what matters to protect the four freedoms is access, per my reply on the MOF’s Class I requirements which permit datasets under “any license or unlicensed”. This is the compromise I believe we need to make, even if temporarily (like we did for the binary blobs), and it will mean @stefano can satisfy the board’s approval criteria (slide 9) that the OSAID “provides real-life examples”, as well as being “ready by October 2024”, and it should be better “supported by diverse stakeholders” (slide 22): Developers, Deployers, End Users, and Subjects.

@Shamar: The bar is set by the lowest acceptable requirements, so while like @lumin I too am surprised and impressed to see this change in the “correct direction”, I agree that the loophole will still permit “Open Source AI” that fails to protect the four freedoms, while also making it unmanageable and unenforceable. We should be building on the MOF’s machine-readable checklist (section 7.3) so we can have a service like Github Actions verifying the claims therein (by HTTP status, hash, etc.) so it’s entirely self-service and the OSI only has to deal with disputes. This is not the same as approving one license for many projects; every project needs its own approval so self-service & automation is essential.

Shamar · October 1, 2024, 9:46am

Sorry @anon18632855, but I’m still confused.

The title of this thread says “The OSAID requires training data to be shared”, but the current FAQ says it allows unshareable data.

Which of the two declaration is false?

Given @nick clarified that the negative votes from the Llama team were are error and “should be converted to 0” and you have shown that this lead to a definition that requires both training and testing data, so, assuming good faith, the FAQ should be fixed before the RC1, shouldn’t it?

PS: I agree that as long as the training and testing data are available to the public under the same terms that allowed their usage from the builders in the first place, we can still count the AI system as “Open Source AI”, since the 4 freedoms are still granted even if the builders cannot directly distribute them.

The proposal that @nick closed a few week ago went in that direction.
Maybe it could be reopened and discussed now?

shujisado · October 1, 2024, 1:29pm

I believe we have definitely made significant progress.
At this point, we are being asked to draw a line regarding what our community can accept, and it seems that line is becoming clearer.

@lumin -san’s concerns are valid, but perhaps that point can continue to be considered as we move forward.

I was thinking again about the layers below regarding what kind of licenses should be applied to datasets, but… well, let’s see the new version.

D1: Datasets composed only of Open Data with an OSD-compliant license applied.
D2: Datasets composed with an OSD-compliant license applied, where the four freedoms are guaranteed for machine learning purposes under applicable law.
D3: Datasets that are accessible to third parties, regardless of whether they are free-charge, and where the four freedoms are guaranteed for machine learning purposes under applicable law.
D4: Datasets that are accessible to third parties, regardless of whether they are free-charge.

Mark · October 1, 2024, 5:26pm

Thank you for the update @nick. The text you post is for the FAQ, by most accounts an informative rather than a normative document. Will a similar change be made in the main text, where the most recent version had only the much weaker requirement for “data information”?

nick · October 1, 2024, 5:46pm

Hi @mark, yes, the main text is receiving similar changes. The “data information” requirement is becoming more precise and more “rigorous” based on community feedback. Updates soon to follow.

anon18632855 · October 1, 2024, 6:22pm

Another pertinent point.

@stefano just asked “What does available mean[…]?” and I suggest it should be accessible via URL with a common protocol (http, ftp, torrent, etc.), without authentication, click-through-agreement, or other impediments, both for users but also the practicality of our own validation scripts being able to check (i.e., HTTP status/headers), download, and hash the dependencies listed in mof.json (section 7.3) or our own equivalent, enabling self-service and avoiding turning the OSI into a bottleneck.

Water seeks its own level, and legalese is similar to scripting so when you interpret the current checklist, the bar is effectively set at “data card” for 0.0.9:

At least one of these data components is required, in decreasing order of importance:

~~Datasets~~

~~Research paper~~

~~Technical report~~

Data card

I don’t think there’s a person here who wouldn’t agree this is inadequate for protecting the four freedoms.

shujisado · October 2, 2024, 12:22am

I’m glad to see such a comment, as there have been few people pointing out deficiencies in the checklist. Yes, as it stands with version 0.0.9’s checklist, simply publishing a data card is enough to meet the conditions.

A month ago, I left a note in HACKMD about the lack of explanations within the data section of the checklist, and I recognize that it will inevitably need to be revised to reflect the discussions we’ve had so far.