The Missing Third Leg: Training Data Excluded from Open Source AI Definition by [Co-]Design

I’m sharing The Missing Third Leg: Training Data Excluded from Open Source AI Definition by [Co-]Design with the community because it refers to the community, but that’s all for now. Sam

I’ve downloaded and transcribed all 18 of the Open Source Initiative (OSI)'s townhalls on the subject of their Open Source AI Definition (OSAID) sideshow so you don’t have to. I used one of OpenAI’s Whisper models (large-v2) whose “performance is close to that of professional human transcribers” (per the paper released with it). “Whisper’s code and model weights are released under the MIT License.” OpenAI also released a model card and Colab example, wrapping it all up in a nice blog post Introducing Whisper, checking several boxes on the proposed checklist.

Despite having been granted access to the Code and Model under the OSI Approved MIT license, the Data (and related code used to collate and massage it into a useful form for training) was conspicuous in its absence beyond a list of links to third-party data sources, and a single file containing subtitles for a few dozen segments of YouTube videos of the Late Show with Stephen Colbert (which seems like more of a troll given it’s such a tiny fraction of the claimed-but-impossible-to-verify 800,000 hours of training data!).

Predictably, users are unable to fully and freely exercise two of the four essential freedoms of Open Source (to Use, Study, Modify, and Share it), which have happily been determined from the outset to be the same for AI; they can use and share it — already a boon for app developers granted — but attempts to modify and study it have resulted in no end of confusion and dead ends.

Open Source AI Lite

Whisper is exactly the kind of Open Source-ish AI the OSI seeks to certify with the upcoming release candidate due to be launched next month. While it partially protects your freedom to use and share it — provided you’re willing to do either without knowing or being able to verify its provenance — it does virtually nothing to protect your freedom to study and modify it (beyond very limited fine-tuning). It’s more akin to the Lesser (now Library) GPL (LGPL), also created for pragmatic adoption reasons similar to those being touted today (but at least then the software itself was Open Source with the exception to link to proprietary code rather than vice versa).

If excluding training data really is deemed non-negotiable, then the other simple solution to the problem is branding acknowledging this limitation, similar to Karsten Wade’s “D+/D-” proposal:

  • Open Source AI Lite
  • Limited Open Source AI
  • Library Open Source AI
  • Lesser Open Source AI
  • Qualified Open Source AI
  • etc.

Co-Design? Consensus? Democracy? Dictatorship.

The process was declared to be “co-design” from the first minute of the first meeting, apparently decided in the closed mailing list phase last year when transparency was not a priority. It was “to come out of consensus”, but then they “asked the group to vote” only to admit yesterday “those results were never meant neither to be scientific, nor representative, nor democratic or anything like that” — no surprise when counting the votes without granting vote nullifying superpowers to certain candidates gave the wrong answer (i.e., that training data should be required)!

Stop gaslighting the community about “co-design” being an inclusive, transparent, and auditable process when the single most important decision — whether or not to require training data — was apparently pre-determined. Don’t endorse it on that basis, and don’t rely on it on that basis. Don’t rely on my opinions either, rather refer to the repo and draw your own conclusions. While I entered this process with open eyes, at this point I start to accept that it is the result of corporate capture and more the voice of the company sponsors (the likes of Cisco and Google) than the community.

Training Data Indecision

One of the first questions in the first meeting asked about training data (unfortunately dissenting voices have a habit of not getting recorded) and in the response it was already insinuated that it may be “sufficient to have a detailed description of what went into it”. We have not moved one millimeter from that position all year despite clear and consistent protest against it.

I asked GPT-4o (gpt.py) to review the Whisper transcripts for the impact of inclusion or exclusion of training data on the four essential freedoms, and to specifically surface evidence of predetermined positions, biases, etc. You can see the constant drumbeat of dissent from that first town hall through yesterday in its analysis below.

High Stakes, Low Expectations

If there is consensus on anything it’s that the release candidate does not fully protect all four essential freedoms (to Use, Study, Modify, and Share), and arguably only partially protects two of them (to Use and Share). Apparently, no practitioners have even attempted to demonstrate that it does (per Model Weights is not enough for Open Source AI), and won’t until after it’s released is impossible to retract.

This means projects like the Personal Artificial Intelligence Operating System (pAI-OS) — all this being part of my work for Kwaai (a volunteer based open source AI research and development lab) when I’m not studying for my masters in CS/ML — will not have the same potent tool in the AI arena that other software projects have enjoyed for decades courtesy the Open Source Definition (OSD), and will instead have to deal with competitors also quasi-legitimately claiming to be Open Source AI.

Worst of all, the proposed OSAID won’t do anything to help projects like Whisper share their work — they already are — nor encourage them to raise the bar by sharing data. At yesterday’s town hall I was asked why the first application to be built on pAI-OS decided to use Llama, and it’s the same reason I used GPT for this analysis: powerful Open Source Large Language Model (LLM) candidates like that of Japan’s National Institute of Informatics (NII) are still a work in progress, despite claims to the contrary by Meta with their “Open Source AI” Llama (which although the OSI agree is not Open Source AI, is closer under the current definition than some are comfortable with). Hugging Face just announced there are more than 1,000,000 free (as in beer, not as in freedom) models on their platform too, but there will be less incentive to create more of them, and less opportunities to “stand on the shoulders of giants” by studying, modifying, and sharing existing models to do so.

The Path to a Meaningful Open Source AI Definition

I’ve been asked to contribute less, and less often, and I plan to accept their invitation, but we will continue holding their feet in the fire in the hope that common sense prevails, perhaps at the upcoming board vote (I’ve reached out to them individually and trust they are seeing these updates).

While OSI co-founder Bruce Perens is still around to talk about it (What comes after open source? Bruce Perens is working on it), another co-founder, the late, great Ian Murdock (the “ian” in Debian of the Debian Free Software Guidelines on which the original definition of Open Source was based) is not. Perhaps this fits in with Bruce’s “Post-Open” pitch, but I wonder what he would think about it? After all, the OSI was founded to further the interests of business — and they’re certainly doing that here — but its roots are in Free Software and the Four Freedoms, which are essential for a reason rather than merely “appreciated” like the training data in earlier drafts.

In other news, I gave a talk on Lies, Damned Lies, and Statistics: On the path to a meaningful Open Source AI Definition yesterday, and Kwaai have agreed to take on the topic in their policy track. While the Open Source Initiative may have been the ideal home for this conversation, it’s not the only one — may a thousand Open Source AI Definitions bloom!

Town Hall 01 - January 12, 2024

Overview:

The first OSI townhall focused on establishing a framework for defining Open Source AI, emphasizing the need to include various stakeholders in the process. The meeting outlined the importance of aligning AI with open source principles, using the four essential freedoms as a guide. Discussions touched on the challenges of incorporating training data within these freedoms, highlighting the complexities and potential biases involved. The atmosphere was collaborative, but there were underlying tensions regarding data transparency and legal considerations.

Key Takeaways

  • The need to define Open Source AI, not just machine learning.
  • Incorporating four freedoms in the AI definition.
  • Challenges in including training data within these freedoms.
  • Stakeholder inclusion is crucial for the definition process.
  • Legal frameworks and documents are key to granting freedoms.

Data Mentions

There were significant discussions on whether and how to include training data in the Open Source AI definition. Questions were raised about the level of access required to the training data, whether the full dataset or just a description would suffice. This was identified as a critical and delicate issue.

Quotes

  1. “‘If I like an AI system, I must be free to share it with other people.’”
  • “‘The question is what kind of access? What level of access? The full on training data set? Or is it sufficient to have a detailed description of what went into it?’”
  • “‘We cannot have an open source AI that will never be transparent, will never be explainable or fair.’”

[refer to blog due to post size limits]

2 Likes

First, on a personal things:
Since the names Bruce and Ian were mentioned, it reminded me of the controversy 20 years ago regarding the Debian Common Core Alliance, as referenced in the URL above. The DCCA was essentially an attempt to create another Debian distribution, led by corporate interests. I clearly expressed my opposition to Ian, the founder of DCCA, and this developed into a controversy involving eWeek. At the time, my company received numerous inquiries from business partners, and dealing with them was very challenging. I consistently argued that the Debian community should remain united and that it was not good for corporate interests to drive division. In the end, DCCA ended up being rather inconsequential.

In Japan, I have been playing the role of a guardian of the term “Open Source” ever since the Open Source movement started in 1998. I have continuously provided Japanese translations and explanations of the OSD and OSI-approved licenses to make them easier for Japanese people to understand. Naturally, from this position, I do not favor community division. I want to make that clear.

On the Co-Design process:

Regarding the Co-Design process, I believe Samj-san is aware that Stefan-san has already stated that it was neither democratic nor transparent. I take the OSI director’s candid statements at face value. As a member of the community, I want to devote myself to correcting OSAID in a better direction.

On dataset completeness:

Putting aside the discussion on the naming, I am starting to think that a two-tier branding system, similar to “GPL/LGPL” or “D+/D-,” is necessary.

We don’t know how important data will become in the future, but at present, everyone participating in this process agrees that data is crucial for research. The issue is its completeness. The datasets and code released by Japan’s National Institute of Informatics (NII), which I have already cited as an example, are expected to serve as the foundation for many AI models developed by Japanese companies in the future. I am sure that, before long, they will release free datasets capable of enabling GPT-4 class models.

Derivative models created based on NII’s work will likely be published under Open Source licenses. It doesn’t seem very fair to me that these NII-derived models and open-weight models, which only disclose the minimum necessary data sources, can both use the same “Open Source AI” branding.

A few months ago, while preparing materials to explain OSAID, I realized that “dataset completeness” has layers, so I created a diagram. Recently, I noticed that it resembles the “D+/D-” discussion, so I remade it earlier, incorporating the “D+/D-” idea. There are some differences from what is being discussed here, but it might help to organize our thinking a bit.

4 Likes

Hi @shujisado,

While I agree this might be a possible solution, license-wise, it still misses a few considerations (which are also missing in almost all conversation elsewhere, including the EU and US legislation discussions): different types of data, versioning and preservation.

Types of data:

  • Train Data
  • Test Data
  • Weights/parameters - the values that were obtained from training the system
  • Model Metadata - the ML design which is the core of the system [note: this should be considered a new type of code, not dat]
  • System Metadata - the precise description of the physical environment used to train the system, including processor (type, and variant) and hardware
  • User Data - the data supplied by the user on their queries, including the queries themselves

Versioning and Preservation: all data must be kept in a versioned dataset repository so its reuse can be tracked precisely.

I’ll be defining these terms more fully in another post in a few days.

Adding fire, I must note that defining OSAI requires three separate documents, which work together and must be read as one single document:

While I think the first is mostly correct, I would still remove the “Preferred form to make modifications to machine-learning systems” section, which, as an explanation, should be in the OSAID FAQ, and change the “Open Source models and Open Source weights” to be in tune with current ai terminology (OECD, OSI, EU_US treaty, IEEE, etc).

The Checklist misses a few considerations about data, as I explained before, to which I’ll be proposing a change.

The FAQ is the troublesome document, as I read there very wild assumptions which lack reasonable context to be understood.

I’ll consolidate my reply and keep it brief:

Agreed, though to quote without an attribution a response I received privately that I also agree with:

The OSAID, if released as is in the current state, would become a milestone where Free/Open depart from each other in the era of AI software.

I can’t control the final decision — apparently none of us can — but we can surface information relevant to making it. I assume good faith by default and trust the best decision will be made with the information available at the time. I hope a fork isn’t forced on us, but fear that’s what’s ahead on the current path — Debian, for example, will not ship anything that doesn’t comply with the DFSG.

Absolutely, or I wouldn’t be here.

If the default definition can’t guarantee that it protects the four freedoms, then the only alternative is two-tier branding.

No, but deliberately applying futures thinking is necessary to avoid drafting a multi-decade definition that quickly proves short-sighted. A generous beta testing period for an experimental release candidate would be instructive here. There will only ever be more data available for training (caveat law changes), and with strong incentives like meaningful Open Source AI certification we could have a spectrum of useful datasets, models (like NII’s LLM), and systems within year/s.

Clearly not, and this is the Open Source Initiative, not the Open Weights Initiative.

Your diagram resembles the one I pitched to OCI 15 years ago in 2009 to “look at the possibility of bringing the OCI/OCP in under the OSI umbrella to save duplication of effort” (OCI/OCP being the Open Cloud Initiative and Open Cloud Principles, modeled on the OSI/OSD):

I abandoned this approach because I believe less is more and that this should be a binary classification as has generally been the case to date, and in any case, free (as in freedom) cloud (and AI) is necessarily not free (as in beer) in a world of services that cost billions to deliver rather than products for which the marginal cost is zero.

A purist approach here (like being purist about open data licenses) is going to prove too restrictive.

“The Common Crawl corpus contains petabytes of data, regularly collected since 2008”, but not every interesting dataset is going to have its own foundation and enjoy free AWS hosting.

If the data your Open Source AI depends on is no longer available for whatever reason then it would immediately and automatically drop out of compliance, in the same way that your Linux distribution would if the FTP server hosting its source fell over.

That’s why any solution I build for this problem will include a machine readable manifest (i.e., checklist) and services for verifying ongoing compliance (availability, hashes, etc.).

This seems unacceptable — if they must be read as a single document they should be a single document, like every other “license” (including the OSD & DFSG).

To @nick’s suggestion about borrowing IETF requirement levels above (whether or not we state it explicitly), they also use Normative and Informative References. While a checklist could — but probably should not be — normative, an FAQ is clearly informative in the same way that the rationales in the annotated OSD are not to be relied upon (I believe lawyers refer to these as ‘recitals’, ‘preambles’ etc. but IANAL).

Thank you for doing this merging/comparing of ideas — and I <3 @samj with the OCI comparison, and yup this problem has been around for a while but there never was much attention on it until the “discovery” of Big Data…

One nuance I see in these patterns highlights the conflicting challenges we have:

  • OSAI components are licensed in layers, especially the data (explanation to follow)
  • Adherence to strictness of definition for each layer, every single time, versus…
  • Vernacular usage that will continue to ignore when one of the deeper layers is non-Open

Models often have a complexity of licensing when you get inside of them that depends on the specific objects of the original dataset — the JPEG of an ancient porcelain vase used in a dataset may be under the copyright of a museum that didn’t license the image … except for the acceptable use policy on the museum website that was in effect in 2019 when Common Crawl came through. And now there are two blocks to the data objects being actually Open, the lack of license from the museum and the terms of use for Common Crawl.

Here is the just-released specific example I’m thinking of, which claims to be Open Source but only is for the part the creators could license:

Researchers from ByteDance have released a multimodal dataset designed for complex mathematical reasoning (article). A glance at the arXiv paper InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning and read of the model card on HuggingFace quickly reveal these layers:

  1. The InfiMM-WebMath-40B multimodal dataset is under the odc-by (Open Data Commons - Attribution license), which is OSD-compliant.
  2. However, the actual documents used — 85 million image URLs and 24 million web documents — came from Common Crawl data snapshots from 2019-2023 (with filtering, etc.) Common Crawl data seems to be under their terms of use, which is definitely not OSD-compliant.

Another simple example would be a model based on geospatial datasets where the actual satellite images were not available under an OSD-compliant license, but everything else is.

In this example, a model creator could release all the code, weights, other metadata and filters and configurations, but the actual JPEGs would not be re-distributable. Yet a person wanting to recreate this model could obtain the images themselves directly from the source, which might even be at no-cost but also come without a license to redistribute.

This last example is based on one brought to me recently and frankly, I’m stumped. The model creator can be the best Open Source actor in the world, ready to collaborate on everything, but since they don’t operate the satellite and have to source the images from entities with different needs and rules, what can they do to participate in the Open Source ecosystem?

If I release an otherwise OSAID-with-Open Data compliant model under these conditions, and I put in the README the exact steps to download the images from a .gov FTP server for zero additional cost, where is there for me to land in this Open Source AI ecosystem?

Completely setting aside nefarious actors, it seems that having the dual-branding (like D+/D-) gives a way for model creators from research, academia, NGOs, and startups to participate in the OSAI ecosystem on a more equal footing, yes?

1 Like

The case of whisper large-v2 model was exactly what I worried about when proposing the ToxicCandy term in ML-Policy. Speaking from academia side, in order to make progress from the scientific research side, what usually happens is you reproduce a piece of work from the others, check whether the outcome matches the expectation (in the whisper case, it’s at least the performance metrics matching what has been reported in their papers/techreports). Then make whatever modification you want to see your own version of large-v2 is better.

Controlled experiments are important. If you use the same training code plus your own modification, but different data, to obtain a performance improvement over large-v2 – in this case you cannot claim “your own modification” is an improvement. Maybe it’s simply contributed by the data.

Without training data, I see no freedom to reproduce the original piece of work, I see no chance to make improvements in the original work. What I see is a plausible “open-source” with a dataset barrier protecting the monopoly, where the whole community around it can only fork the model while being fully controlled by the capital. Once the company behind it goes down, nobody else in the world would be able to reproduce this work in order to continue maintaining the work or the piece of AI software. This is not what a free software community looks like.

I respect OSI’s hard work on converging to somewhere practical. But I still insist on my personal opinion that the current state is not a good idea for the long run.

2 Likes

Yes. Per my reply to @Shamar, the precedent is for the licence that is limited for pragmatic reasons (e.g., LGPL) to adopt the alternative branding while still protecting the four freedoms. Alternatively, any compromise could be temporary under a single brand, but it may be difficult or impossible to revoke later.

“The proposal is that we acknowledge that taking a purist approach to data (e.g., demanding open data licenses) will drastically limit the number of candidates for certification”, to your points above, rather “requiring data (analogous to the proprietary code [under the LGPL precedent]) be accessible when building (training) the Open Source licensed redistributable software (model), in order to protect the four freedoms”.

Freeware (free as in beer, not as in freedom) is what we used to call software that was made available in binary-only form to use and share but not study or modify, as you know, and this is one of the main reasons people are reluctant (or prohibited by policies) to rely on it. The current draft is basically freeware for AI, and this impinges on the freedom to even use the software.

No, it’s not, which is why I consider this a train worth throwing ourselves in front of.

1 Like