It is true there is always the solution you recommend, which is to tell everyone (my paraphrase):
As with Open Source, the Open Source AI definition is narrow and specific to only those things which are fully Open, to include Open Data, Open Weights, Open Configs, and Open Source software.
If you do not release one of those components as Open, the whole thing cannot be called Open Source AI.
Thus, a model without an Open Data dataset can only be called “a model made of Open Source, Open Weights, and Open Configs but without the Open Data.”
This is the scenario where the hammer swing smashes without nuance anything that might be Open-enough. The other hammer swing is the one smashing all unavailable datasets as “good enough” regardless of nuance.
But darn it, then you pondered this:
I just don’t understand why we want to spread the use of the words Open Source so much.
Because the world is not a binary place and a hammer may not be the tool for this situation. At the very least, the fact that friends and colleagues are at loggerheads over the situation indicates there is something worth slowing down here and thinking more deeply about before we continue.
My concern is the spreading of accurate Open practices wherever possible as part of a public good, because unintended benefits are proven to flow in Open Source ecosystems that are properly-licensed and permission-cultured — both license and vibe matter, right?
As we work to create appropriate boundaries, we want to be sure that good science, art, philosophy, et al are not excluded accidentally in our focus on machine reproducibility. The question of concern to me are the unintended consequences of swinging this hammer on behalf of the binary positions my post is in response to.
Especially making decisions with machine reproducibility at the center, when we can’t even guarantee that across two different batches of the same numbered chip sets.
I have given just two examples – lifesaving medical data that is ethically sourced (natch) and thus protected from release, and different country laws crossed with where people reside in the world – but there are more.
What is the way to extend the world of Open Collaboration to these two examples? Is it the long list of confusing terms that people will inevitably shorten to “Open Source AI” and we’ll be having this circa 2009 argument but literally a million times more people involved?
Because that is one of the things at stake right now — setting aside any hidden agendas around dataset Openness — these topics we always knew were meaningful for humanity are becoming meaningful to humanity.
When openwashing is the topic of New York Times and Forbes articles, and they’re written fairly accurately, I mean … that is the sign people are becoming aware of Open, all those voices we’ve been wanting, yearning to have at the discussion and decision tables for years.
The stakes for this work are higher than I can perceive alone, but my sense is there is nuance to resolve in the Open Data discussion. What tools beside a hammer can we bring to the problem? What happens if we pull on the thread of this pattern I’ve identified here?
Are there people who understand issues around Open Data and who don’t know we’re at these loggerheads? Let’s go find them and ask for help. Let’s get the story off the newspage and into podcasts and social media as a search for help.
These folks may have answers, they may have more use cases to solve for, they may even have someone capable of testing the value in D- cases as in An open call to test OpenVLA so we can see if the current words around training datasets are effective in practice.