Data Transparency in Open Source AI: Protecting Sensitive Datasets

system · September 24, 2024, 10:49am

Originally published at: Data Transparency in Open Source AI: Protecting Sensitive Datasets – Open Source Initiative

The Open Source Initiative (OSI) is running a series of stories about a few of the people involved in the Open Source AI Definition (OSAID) co-design process. Today, we are featuring Tarunima Prabhakar, one of the volunteers who has helped to shape and are shaping the OSAID.

Shamar · September 25, 2024, 12:05am

A very interesting read.

We are trying to find a happy medium that lets us balance the numerous concerns- recognition of effort and effectiveness of the data on one hand, and transparency, adaptability and extensibility on the other.
[…]
For our project in particular, we are considering the option of staggered data release- older data is released under an open data license, while the newest data requires users to request access.

This remind me of the ongoing work at Sentry around Fair Source Software: maybe Tattle simply needs a Fair Source AI definition that turn their system to Open Source AI when the corresponding datasets are released under an open data license?

Or maybe OSI recently approved licenses like FSL, FCL and BSL?
Do they match the OSD?

If not, I can’t see why an AI system delaying the freedom to study the system should be qualified as Open Source AI.

gvlx · September 25, 2024, 1:36am

I fail to see any justification on labeling something “Open Source” when it obviously does not respect the basic principles the community holds.

And also fail to see any drama on an iA not being an “Open Source Ai” because any of its datasets or model networks are not compliant.

If it’s not “Open Source Ai”, it’s not “Open Source Ai”. period.

For the past years I’ve seen countless attempts of newer licenses being invalidated. They never made it to the OSI list.

No one really cared: the proponents carried on with their business, they just wouldn’t be allowed to market it as “Open Source”, and their customers didn’t care.

The happens here: if an Ai system can’t have all its the components respecting the values of the Open Source community, it’s not “Open Source Ai”. period.

Now change your dates, redo the presentations, and start correcting things on the 0.0.10 version.

anon18632855 · October 10, 2024, 8:05pm

With the benefit of hindsight (I rejoined the fray two weeks ago, the day before this was posted), it’s clear @nick posted this as justification for excluding training datasets from the 0.0.9 definition. And on first pass it sounds like a reasonable argument for the intended use case (filtering abuse) — hide the filters to prevent their circumvention, right?

Aside from being security by obscurity (thus no security at all) — a filter that reliably detects the meaning of a message regardless of the terminology used would be better, and feasible with today’s technology — one could present the counterargument that widespread availability of abuse filters is more beneficial to society on balance.

Fine, release the model as Open Weights, and everything around it as Open Source. Nobody says you have to release as Open Source, and you’ve presented good reason not to. Indeed, the benefit of releasing the data in this instance may exceed the costs, like publishing bomb-making instructions which nobody is arguing should be Open Source.

In any case, the primary argument given for releasing as Open Source was that they’ve “historically operated as an Open Source organization”. They even proposed a simple solution that would allow them to release the model as Open Source even under a more ambitious and less contentious definition:

For our project in particular, we are considering the option of staggered data release- older data is released under an open data license, while the newest data requires users to request access.

Once again the (implicit) claim is not in the citation given.