The OECD’s definition of AI defines a wide range of reasoning systems as AI. It seems to me that this applies to a wider range of systems than those generally referred to as LLMs or Diffusion Models, and that many of the simpler systems are simply programs. In other words, there is a concern that there will be overlap with the areas addressed by OSD, but the article is vague on the relationship between OSIs OSD and this OSAID.
I think this is a big question that deserves to be discussed widely here. What do you think?
I believe the OECD definition is both too broad and too vague and is the biggest flaw of the current draft. One reason why it is problematic is well captured in the comment from Shuji Sado. The only reason for a special definition of “open source AI” is if the things being called open source AI have characteristics that differ from other software; otherwise, the OSI should simply undertake a general revision of the OSD if they think it is no longer suitable. The OECD definition could easily cover the kinds of conventional software systems that we have assumed can be measured solely against the OSD (or, equivalently, the Free Software Definition) to determine whether they are FOSS.
As far as I can tell, the only technical characteristic we see in the space now commonly being described as “AI” that makes it different from the conventional software for which the OSD has been seen as suitable is the inclusion of model weights as the core element of such systems (if “system” is even an appropriate word here). This is a critically different feature and may indeed justify a new or specialized “open source” definition (although I’m actually undecided on that) . But so-called “AI systems” in general, using the OECD definition, might not be based on the machine learning paradigm at all. Why is a new definition needed for such systems?
For example, consider two public projects that aim to implement a spam classification system. One uses machine learning, one does not. The first includes model weights along with some reference implementation source code and documentation. The second includes source code and documentation under, let’s say, entirely OSD-conformant licenses. If the two systems do roughly the same thing, why would one be an “AI system” and not the other? If both are “AI systems”, is the second potentially no longer open source once the OSI adopts this definition? Or is it “open source software that does AI stuff” but not “open source AI”?
I honestly do not understand why the fact that the OECD definition is being reused a lot at present by regulators means it should be embedded in a definition of the OSI. For one thing, I don’t see regulators as the primary audience or target of this effort. For another thing, the fact that a flawed definition is in use by regulators (think of how regulators have approached defining “open source” itself) should not mean that the OSI shouldn’t strive for something better, and in particular something better grounded in technical reality and the current situation that I believe is giving rise to this effort, which is entirely specific to machine learning.
Interesting to note that the EU IA Act modify a bit its definition of AI system:
“‘AI system‘ is a machine-based system designed to operate with varying levels of autonomy and that may exhibit adaptiveness after deployment and that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments”
The issue is that according to the new wording, the AI systems covered by the definition are those that “may exhibit adaptiveness after deployment”.
In this context, adaptiveness means the ability of an AI system to self-learn or adapt from new data inputs after deployment - a feature intrinsic to some, but not all, AI systems, especially if we consider older rule-based systems.
This mean that the definition has been altered to exempt rules-based (non-adaptive) systems from the scope of the AI Act.
I am not married to the OECD definition but we need to refer to a well understood and agreed upon definition of AI system in order to define “open source” for it. We tried discussing “open” without that reference framework and the discussion was going all over the place.
Also, I fail to see how we can keep ‘openness’ a binary attribute of a system without defining it. Other attempts to judge the openness of AI talk of ranges and I don’t think that’s an acceptable outcome for OSI. I wrote about this topic on
That said, I’m not sure we can judge the scenario raised by @fontana until we have a clearer picture of what the OS AI Definition looks like in practice. It may well be that in practice both pieces of software can be judged by the same parameter or that the OSAID will cover also “regular” software… We don’t know yet.
“Artificial intelligence” is a very broad an vague phrase, even in the field of AI itself. It has been for decades and I don’t see that changing.
There are many programs which some would describe as “artificial intelligence” and some would not. The debate is mostly meaningless and won’t be settled by this definition, while a definition that does settle it would not align with how the phrase is used in practice.
I think that, if OSI gets this right, it won’t matter much if something is AI or not, because the OSD and the OSAID will be well-aligned enough that whenever it’s not clear which definition should apply, they are equivalent in practice.
I moved some of the posts in this thread to the Training data access thread where they’re more appropriate.
Thanks @jplorre, that’s an interesting approach, I’d be open to use the same wording. I’d love to hear more suggestions. @fontana have you seen more precise definitions of AI systems that we should use instead?
@stefano I do not think we need a definition of “AI system” to solve the problem here. I think focusing on this concept and how to define it is unnecessarily complicating the task. Where else has the OSI focused on “systems” and the boundaries that are implied by them? Historically the OSI has focused on things that are released. There are two questions: (1) are the terms sufficiently permissive? (2) are you getting enough stuff to exercise the freedoms associated with FOSS. Those questions arise in the machine learning context too. There’s no need to complicate this with definitions of “systems” that appear to be removed from technical reality.
Good question. For the FSD and OSD the target is the program or software, mentioned multiple times in the GNU Manifesto, whose meaning is (un)consciously recognized by practitioners. One could argue that those concepts are also quite vague in modern computing but let’s stick to AI.
I am still convinced that we should make the effort to have a clear target of what the freedoms refer to in the AI space. But I’d love to hear other opinions, too.
And I’d love to hear alternative proposals from you @fontana . The current OSAID starts by saying “this is what we’re talking about” (the AI system definition), followed by a condensed manifesto, then the freedoms, etc… Is your argument that we don’t need to define AI at all? How would you rephrase the manifesto?
It appears to me that OSAID is targeting AI systems like LLM that are within the scope of deep learning in this figure.
However, the OECD definition of AI covers the scope of machine learning (or beyond) and is not restricted to being software. It is a very vague concept. I do not believe we are defining “open” in the realm of data mining or statistical algorithms, so I still think the OECD definition of AI is too broad.
There does not appear to be an adequate definition of AI at this time to adequately describe that narrow scope.
However, since the OSAID’s checklist indicates that it covers systems that are commonly referred to as AI systems today, there may be no reason to provide a definition of AI. I’m not a bit sure, though.
I think it’s quite important to have a definition which would apply to all AI systems, including hardcoded symbolic rule systems. For these, however, the definition ought to be equivalent in practice to the OSD, so that consistency is preserved and the massive gray area between AI and not AI doesn’t lead to any gray area between what’s classified as “open source” and what isn’t.
One thought I had today about the problems of the “AI system” terminology (leaving aside the definition): One effect might be that publishers of machine learning models under non-OSD-conformant licenses would say “well, we have an open model, an open source AI model; we don’t claim that what we have is an “AI system” so we don’t consider it to be subject to the criteria set forth by the OSI in the OSAID”.
In other words, the use of this “system” terminology is a complication that may have the effect of narrowing the perceived scope of what the OSAID covers. Is the thought that the ordinary OSD kicks in in cases where purportedly you don’t have a “system”? That relates to a question that has been raised by at least a few of us here, I think. What is the relationship between the OSD and the OSAID? When does one end and the other begin? I’m concerned the “system” concept creates a loophole of some sort that permits someone to call a model “open” because it is somehow beyond the scope of what the OSD is thought to cover but is also outside of this concept of a “system”.
That’s one of the reasons why we added the “AI system” to the debate. We were going around in circles discussing exactly this use case: the model weights are available with MIT/BSD-like terms so it’s open source, right? Except that the weights alone are not very useful: you need more, like the analysis of Llama2, Pythia, OpenCV and BLOOM reveal. Anchoring the working groups around the definition of AI system helped answering the question “what exactly do you need to run, study, modify, share [Llama2 | Pythia | OpenCV | BLOOM]?” Now we have a good, shared idea.
One artifact I’m producing for draft 0.0.6 is a diagram like the one below to summarize the findings of the workgroups. Draft 0.0.6 will start listing which components are necessary to qualify as an Open Source AI: it will be one that shares training code, supporting libraries under an OSD-compatible license and model parameters, architecture under a question mark. Solving that question mark will be the next exercise.
The model architecture is usually defined in the code. That will probably overlap with the training code, validation, and testing code. The training, validation, and testing will typically share the same neural network instantiation code or instance.
An example train.py
model = create_network() # create neural network instance (the architecture, and randomly initialized model weights/parameters)
for batch in dataset: # iterate over dataset for X epochs
train(model, batch) # update the model
An example validation/test code:
model = create_network() # instantiate with random parameters
model.load_state_dict(network_parameter_file) # load pre-trained weights
for batch in validation_dataset:
prediction = model(batch)
calculate_metrics # such as accuracy
aggregate metrics and report
My take is that they should be effectively equivalent every time it isn’t clear which one should apply, or both apply. Which is actually a lot of the times.
Consider handwritten rule-based symbolic systems. They are classical computer programs and they are AI systems.
They aren’t deep learning systems of course, but this is the “Open Source AI Definition”, not the “Open Source Deep Learning Definition”, so of course it shouldn’t lead to some programs being open source according to one but not the other.
A proprietary system can include open source components. That is actually the norm with non-copyleft licenses and it’s common with weak copyleft licenses.
But a program is, as a whole, open source, if all of its components are. A program which contains a featured powered by a ML model (and therefore the model itself) should include inference source code and the model itself under an open source license.
I think nothing should be required which isn’t included in the program in some form. Anything which is included must be included in source code format if it’s code or in an unrestricted DRM-free format which can be parsed with open source software in all other cases.
Obviously any component which none of us is going to think about because is either weirdly specific or not part of any AI system which currently exists must also be open source anywhere it’s included for that system to be, as a whole, open source.
To evaluate whether a system is open source one should consider the components of that system.