I’m sharing The Missing Third Leg: Training Data Excluded from Open Source AI Definition by [Co-]Design with the community because it refers to the community, but that’s all for now. Sam
I’ve downloaded and transcribed all 18 of the Open Source Initiative (OSI)'s townhalls on the subject of their Open Source AI Definition (OSAID) sideshow so you don’t have to. I used one of OpenAI’s Whisper models (large-v2
) whose “performance is close to that of professional human transcribers” (per the paper released with it). “Whisper’s code and model weights are released under the MIT License.” OpenAI also released a model card and Colab example, wrapping it all up in a nice blog post Introducing Whisper, checking several boxes on the proposed checklist.
Despite having been granted access to the Code and Model under the OSI Approved MIT license, the Data (and related code used to collate and massage it into a useful form for training) was conspicuous in its absence beyond a list of links to third-party data sources, and a single file containing subtitles for a few dozen segments of YouTube videos of the Late Show with Stephen Colbert (which seems like more of a troll given it’s such a tiny fraction of the claimed-but-impossible-to-verify 800,000 hours of training data!).
Predictably, users are unable to fully and freely exercise two of the four essential freedoms of Open Source (to Use, Study, Modify, and Share it), which have happily been determined from the outset to be the same for AI; they can use and share it — already a boon for app developers granted — but attempts to modify and study it have resulted in no end of confusion and dead ends.
Open Source AI Lite
Whisper is exactly the kind of Open Source-ish AI the OSI seeks to certify with the upcoming release candidate due to be launched next month. While it partially protects your freedom to use and share it — provided you’re willing to do either without knowing or being able to verify its provenance — it does virtually nothing to protect your freedom to study and modify it (beyond very limited fine-tuning). It’s more akin to the Lesser (now Library) GPL (LGPL), also created for pragmatic adoption reasons similar to those being touted today (but at least then the software itself was Open Source with the exception to link to proprietary code rather than vice versa).
If excluding training data really is deemed non-negotiable, then the other simple solution to the problem is branding acknowledging this limitation, similar to Karsten Wade’s “D+/D-” proposal:
- Open Source AI Lite
- Limited Open Source AI
- Library Open Source AI
- Lesser Open Source AI
- Qualified Open Source AI
- etc.
Co-Design? Consensus? Democracy? Dictatorship.
The process was declared to be “co-design” from the first minute of the first meeting, apparently decided in the closed mailing list phase last year when transparency was not a priority. It was “to come out of consensus”, but then they “asked the group to vote” only to admit yesterday “those results were never meant neither to be scientific, nor representative, nor democratic or anything like that” — no surprise when counting the votes without granting vote nullifying superpowers to certain candidates gave the wrong answer (i.e., that training data should be required)!
Stop gaslighting the community about “co-design” being an inclusive, transparent, and auditable process when the single most important decision — whether or not to require training data — was apparently pre-determined. Don’t endorse it on that basis, and don’t rely on it on that basis. Don’t rely on my opinions either, rather refer to the repo and draw your own conclusions. While I entered this process with open eyes, at this point I start to accept that it is the result of corporate capture and more the voice of the company sponsors (the likes of Cisco and Google) than the community.
Training Data Indecision
One of the first questions in the first meeting asked about training data (unfortunately dissenting voices have a habit of not getting recorded) and in the response it was already insinuated that it may be “sufficient to have a detailed description of what went into it”. We have not moved one millimeter from that position all year despite clear and consistent protest against it.
I asked GPT-4o (gpt.py
) to review the Whisper transcripts for the impact of inclusion or exclusion of training data on the four essential freedoms, and to specifically surface evidence of predetermined positions, biases, etc. You can see the constant drumbeat of dissent from that first town hall through yesterday in its analysis below.
High Stakes, Low Expectations
If there is consensus on anything it’s that the release candidate does not fully protect all four essential freedoms (to Use, Study, Modify, and Share), and arguably only partially protects two of them (to Use and Share). Apparently, no practitioners have even attempted to demonstrate that it does (per Model Weights is not enough for Open Source AI), and won’t until after it’s released is impossible to retract.
This means projects like the Personal Artificial Intelligence Operating System (pAI-OS) — all this being part of my work for Kwaai (a volunteer based open source AI research and development lab) when I’m not studying for my masters in CS/ML — will not have the same potent tool in the AI arena that other software projects have enjoyed for decades courtesy the Open Source Definition (OSD), and will instead have to deal with competitors also quasi-legitimately claiming to be Open Source AI.
Worst of all, the proposed OSAID won’t do anything to help projects like Whisper share their work — they already are — nor encourage them to raise the bar by sharing data. At yesterday’s town hall I was asked why the first application to be built on pAI-OS decided to use Llama, and it’s the same reason I used GPT for this analysis: powerful Open Source Large Language Model (LLM) candidates like that of Japan’s National Institute of Informatics (NII) are still a work in progress, despite claims to the contrary by Meta with their “Open Source AI” Llama (which although the OSI agree is not Open Source AI, is closer under the current definition than some are comfortable with). Hugging Face just announced there are more than 1,000,000 free (as in beer, not as in freedom) models on their platform too, but there will be less incentive to create more of them, and less opportunities to “stand on the shoulders of giants” by studying, modifying, and sharing existing models to do so.
The Path to a Meaningful Open Source AI Definition
I’ve been asked to contribute less, and less often, and I plan to accept their invitation, but we will continue holding their feet in the fire in the hope that common sense prevails, perhaps at the upcoming board vote (I’ve reached out to them individually and trust they are seeing these updates).
While OSI co-founder Bruce Perens is still around to talk about it (What comes after open source? Bruce Perens is working on it), another co-founder, the late, great Ian Murdock (the “ian” in Debian of the Debian Free Software Guidelines on which the original definition of Open Source was based) is not. Perhaps this fits in with Bruce’s “Post-Open” pitch, but I wonder what he would think about it? After all, the OSI was founded to further the interests of business — and they’re certainly doing that here — but its roots are in Free Software and the Four Freedoms, which are essential for a reason rather than merely “appreciated” like the training data in earlier drafts.
In other news, I gave a talk on Lies, Damned Lies, and Statistics: On the path to a meaningful Open Source AI Definition yesterday, and Kwaai have agreed to take on the topic in their policy track. While the Open Source Initiative may have been the ideal home for this conversation, it’s not the only one — may a thousand Open Source AI Definitions bloom!
Town Hall 01 - January 12, 2024
Overview:
The first OSI townhall focused on establishing a framework for defining Open Source AI, emphasizing the need to include various stakeholders in the process. The meeting outlined the importance of aligning AI with open source principles, using the four essential freedoms as a guide. Discussions touched on the challenges of incorporating training data within these freedoms, highlighting the complexities and potential biases involved. The atmosphere was collaborative, but there were underlying tensions regarding data transparency and legal considerations.
Key Takeaways
- The need to define Open Source AI, not just machine learning.
- Incorporating four freedoms in the AI definition.
- Challenges in including training data within these freedoms.
- Stakeholder inclusion is crucial for the definition process.
- Legal frameworks and documents are key to granting freedoms.
Data Mentions
There were significant discussions on whether and how to include training data in the Open Source AI definition. Questions were raised about the level of access required to the training data, whether the full dataset or just a description would suffice. This was identified as a critical and delicate issue.
Quotes
- “‘If I like an AI system, I must be free to share it with other people.’”
- “‘The question is what kind of access? What level of access? The full on training data set? Or is it sufficient to have a detailed description of what went into it?’”
- “‘We cannot have an open source AI that will never be transparent, will never be explainable or fair.’”
[refer to blog due to post size limits]