Recognising Open Source "Components" of an AI System

From a practitioner perspective; it would be helpful to know if a system that fails to meet the specification of Open Source AI (assuming the components consist of Source Code, Model, Data – for the purpose of this discussion) but still has open source components.

If lets say the model weights and code of an AI system are made available under open source terms, I think it would be beneficial for them to be able to signal that clearly. The RAIL licenses do something similar but with a view on responsible-use clauses and allow licensors to specify which components are subject to usage restrictions.

[Posting image below from the RAIL blog]


While an Open Source AI system may require all components of it to be available open-source source terms, restrictions on how models are developed or data is sourced may mean they are very few real-world Open Source AI systems.

Would it be good to have the Open Source AI definition to support modular definitions of Open Source AI Components and license terms for them?

Many AI systems with usage restrictions in fact apply restrictions on models and data only.

Three diverse examples:

(i) BigScience BLOOM → Open source licenses for code, RAIL License for models weights
(ii) Olmo Model (OLMo - Open Language Model by AI2) → RAIL-like license for data (Dolma dataset), Open Source license for code, model weights
(iii) Aya Model (Aya | Cohere For AI) → Looks like everything is under an Apache 2.0 license but they call it open-access and not open-source. (I haven’t studied it in detail).

1 Like

I completely agree with this, but the model and the weights are the same thing.

Obviously for a larger system to be open source, all of its components need to be. Indeed, I don’t think an open source program which uses weights under an open source license should have to include training code, for example.

I think this creates no ambiguity at all.

Those training/inference programs are clearly open source, while the models being distributed are clearly proprietary (unless it’s completely clear copyright doesn’t apply at all and one can get them without signing any contract).

1 Like

It doesn’t create ambiguity but perhaps we should create OS definitions for components and recognize them as individual artifacts explicitly.

We are in the process of defining what an Open Source AI system is (which we assume subsumes model weights); but what if someone releases only the model weights under an Apache 2.0 license. The current draft will not let it be “called” anything. What is it? Is it an Open Source Model? Yes? No?

We should make that explicit (like what RAIL does with DAMS). I dont think the all or nothing approach to Open Source AI is the way to go either (its a bit too rigid).

PS: I’m using “model weights” as a Noun phrase :smiley:

1 Like

Personally I’d argue that model would be open source if it’s in a format that can be processed with free and open source software. So no DRM, no proprietary formats or anything of the sort.

The (Open)RAIL licenses are proprietary licenses, completely outside of the scope of open source anything. Part of the reason for defining different assets is that different licenses provide proprietary restrictions on different kinds of assets and many wish to have a combination of open and proprietary assets (such as when releasing code as open source, but models under a RAIL license).

I don’t think anyone would argue that training code can’t be open source if you don’t provide a trained model with it. Likewise, inference code, by itself, may very well be open source without weights. In my view at least, it’s also the case that a trained model should be able to qualify as “open source” without neither, because neither is part of the same asset.

But anything which is made of multiple components should absolutely only be classified as “open source” if all of those components are open source.

2 Likes

Yes, and I’m saying instead of letting people interpret whether its Open Source or not; the standard should make that clear.

Right now, the scope of the standard is to define the “full” system without stating what it means for a model to be “open”-source(?) or for the dataset of an Open Source AI system to be “open”-source(?).

If its always “everything” or “nothing”; then I worry we don’t gain much by having the definition of “Open Source AI”. It will only serve the purpose of making clear – nothing else is Open Source AI and there’ll be very very few examples of “true” open source AI.

I don’t think anyone would argue that training code can’t be open source if you don’t provide a trained model with it. Likewise, inference code, by itself, may very well be open source without weights. In my view at least, it’s also the case that a trained model should be able to qualify as “open source” without neither, because neither is part of the same asset.

But anything which is made of multiple components should absolutely only be classified as “open source” if all of those components are open source.

+1, but should the standard be silent about this, is my question

What you are saying is true for all kinds of assets, especially software. A program contains many modules. Some can be open source while others are proprietary. But all of them have to be open source for the program, as a whole, to be.

Why? If you have an open source program which contains an open source model the whole system is open source.

This seems to be a variant on a criticism of this effort that a number of us have been voicing, and I’ve also heard it from people not participating directly in this effort but who are following it with interest. The direction this effort seems to be going in is hopelessly impractical, in that no interesting “AI systems” will ever be able to meet the high bar that is being set. (Maybe I’m misunderstanding that direction, though, but it’s the sense I am getting.) It’s not that the approach is fundamentally wrong as a philosophical matter, but that it is utopian in a way that traditional pragmatic definitions of free software/open source are not.

If we want to see something approximating traditional libre software in the machine learning space, let’s focus on the licenses covering what is released as I believe that is where we can practically have some influence.

The main problem we face is very simple: machine learning practitioners, and their employers, are using “open source” to mean “public”. The OSI should focus on combatting that problem, and not try to solve a larger problem that is intractable.

let me get the records straight: there is no bar set, yet. I’ll voice this more clearly and publicly with the next draft and I’ll get the board to say something official, too. My recommendation to the board will be to reinforce that the organization is not going to accept a definition that contains a null set of AI systems, or one that is so totally impractical that only a handful of small, toy, demo AI will fit in it.

We tried that approach and it failed rapidly, devolving into a massive amount of objections and responses that includes “well, it depends on…” followed by long arguments about data, models, code, components etc.

As we learned from telling Meta that their model is not “open source”, OSI can’t be effective without a clear definition of what Open Source AI means. We need to clear this hurdle, first.

1 Like

Thank you @stefano, that is very reassuring since that is a key issue that a number of people both participating here and elsewhere seem to be concerned about.

2 Likes

Thinking about this some more: I would say a “null set” OSAID might be justifiable if there simply were no sets of model artifacts being released under OSD/FSD-conformant licensing terms (let’s leave aside what “OSD/FSD-conformant” ought to mean in that context). My comment was focused on the kinds of materials typically not being made “open” today (particularly training data). A desire for adoption of OSAID is not worth total abandonment or adulteration of the OSI’s historical principles.

I worked on this concept modeled after creative commons type labels as a way to easily and quickly communicate what components / reproducibility is available in a Enhancing AI Transparency: Understanding Open Model Labels - AI Models

I wouldn’t say it is complete (there is still maybe a few more questions, open ‘outputs’ seem to be discussed a lot still, for example Coqui XTTS which had non-commercial clause)

If we put reproducibility first, many other parts start to fall into place. The ideas in https://lfaidata.foundation/blog/2024/02/14/adapting-the-definition-of-open-source-to-ai-the-quest-for-the-holy-grail/ delves deeper along these lines.

‘Open weights’ is already starting to be used (NTIA RFC for example) so these ideas are moving to popular use, which further demonstrates the need for a component based approach.

1 Like