Draft v.0.0.9 of the Open Source AI Definition is available for comments

In Japan, the term “skilled person” is often associated with patent law, referring to an “imaginary person with ordinary skill and knowledge in a specific technical field.” I have translated every version of OSAID into Japanese, and for the term “skilled person,” I chose a term that simply means a person skilled in the technology, rather than using the term derived from patent law. This was to avoid any futile discussions in Japan about what constitutes “ordinary skill and knowledge.”

To be honest, I wish the term were a bit clearer.

1 Like

What alternative term would you recommend? Does “practitioner” or “AI practitioner” look any better?

How about “any random person with a computer”? :grimacing:

No, but seriously… I think a key success for open source is that anybody can get started with it. I’m just a case in point, I have a ph.d. in informatics, but I got there from being autodidact. My father got a PC in the house in 1985, I started out looking at BASIC code as a kid, and when the Web came around, it was “View Source”. I did write code as a teenager with extremely little training, and just a few courses as part of my master’s in theoretical astrophysics.

Apart from that, it has just been looking at other people’s code, RTFMing and eventually working with better coders than myself. And yet, here I am. I would hate to see the bar sat any higher, I think humanity would be poorer for it.

1 Like

How about “any random person with a computer”?

any random person assisted by AI :wink:

Or better yet:

Sufficiently detailed information about the data used to train the system, so that an AI system can autonomously recreate a substantially equivalent system using the same or similar data.

Eventually we’ll get there… :grinning:

1 Like

It’s not well-accepted that you can deterministically recreate “realistic” scaled models.

The second you are using non-deterministic hardware (e.g., NV cards), architectures without deterministic implementations, or have distributed computing at the node level, you are going to have intrinsic randomness that no seed-setting can fix.

See, e.g., the various notes on here for even single node, single GPU issues:
https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms

PS: this is all ignoring corruption that occurs. There’s a reason that nvidia-smi reports the * Volatile Uncorr. ECC* field, and I’ve personally seen dozens of 8xH100 boxes spit out garbage tensors for hours/days before anything shows up in dmesg.

If you’re lucky, the backwards step/optimizer fails and you know to revert to a prior checkpoint, but it’s a fact that every big model out there contains corrupted forward step tensors that are truly random data.

2 Likes

As discussed, byte-for-byte reproducibility is neither an achievable nor necessarily useful goal.

Similarly, with open source software you’re going to end up with binaries that have different checksums but perform the same way if you compile on different machines/with different flags/etc. so including random seeds (“Data Information”) while not demanding the data itself risks missing the forest for the trees.

1 Like

It’s worth noting that several examples of existing “truly” Open Source AI models have already been mentioned including StarCoder2 based on The Stack v2 cited here and the Open Language Model (OLMo) based on AI2 Dolma cited here, so arguing that no models will meet the standard is not only not relevant — we need to set the litmus test based on existing requirements like “preferred form” rather than convenience — but not valid. In any case, it’s all the more reason to encourage the development of new Open Source AI models rather than accepting the clearly inadequate status quo.

Also note that the existence of these models satisfies the (questionable) board requirement that the OSAID “Provides real-life examples: The definition must include relevant examples of AI systems that comply with it at the time of approval, so cannot have an empty set.”

3 Likes

Hello

I made a number of comments on the following page:

Namely:

  • for coherence and clarity, always referring to “Open Source AI” (adding Artificial Intelligence in parenthesis to expand, and not the reverse)
  • reviewing the title also for coherence: The definition for “Open Source AI” OR The Open Source Definition for AI Systems
  • subtle changes in the first paragraph, add “at least” to “the same”
  • capitalize all words in “Open Source models and weights”
  • minor edits in the last section

Not sure I was supposed to comment there or here.

Best,

Yann

1 Like

Thanks @yannlechelle, comments directly on HackMD are fine. All are taken into account.

1 Like

Hi @mjbommar, @samj and all, I’m sorry for the delay but I’ve just got back access to my account after being silenced for the proposal to separate concerns between source data and processing information.

The idea was simply to require training data to be available under the same terms that allowed their use in training in the first place whenever we cannot require them to be made available, so that no legal issue arise from the requirement to distribute them.

But probably my English is way worse than I suspect, so the thread is closed without any comment, but for a few questions I’m not allowed to answer.

Race conditions are bugs, not features.
So are data corruptions in RAM.

While they might constitute a sustainable technical dept in some situations, they shouldn’t be “normalized”. In fact, a lot of work as been done to achieve determinism since the GTC 2019 talk Determinism in Deep Learning.

I’d argue that it’s always possible (accepting a performance toll) to get exact training reproducibility (on the same hardware) with proper design.

But let’s suppose you face some technical limitation that affects all the existing hardware and that something along the line “sort floats before adding them” cannot fix.

The source of randomness can be identified and recorded, so that people can still study exactly how the training process produced the weights from the original data. At worst, you’d have to record the whole training process. Heavy, slow, expensive (without the proper environment) but not impossible.

Now the point is: why should you?

In several AI systems, you don’t need to do much to achieve reproducibility (the training process is inherently reproducible), but in a few ones with huge social impact, it’s needed to avoid the open washing of black boxes.

Sure, as @samj pointed out, the open source definition does not mandate reproducible builds and it’s normal to get different binaries when compiling a project with different flags.

But such non-reproducible builds do not inhibit the freedom to study the software! As XZ Utils taught us, you can always inspect the binaries and check it’s correspondence with the intended code.

Instead, afaik, with some statistical AI systems (LLM, ANN etc…) the only way to ensure that the training data declared by the developers are actually the one used to compute the weights distributed, is to replicate the process.

So the only way to prevent the Open Source AI definition to becomes a Open Washing AI definition, used to fool users and get their trust while preventing them to really study the actual system they are using, is to require such reproducibility (that, as said, might amount at worst to recording the whole training process).

I’m more than happy to learn about a different way to obtain the same guarantee about the completeness of the declared training data.

But a OSAID that can be used to open wash black boxes, negating the freedom to study the system and selling the freedom to fine tune as the system to modify, would be inherently unsecure and detrimental to users and researchers.

So I hope we can keep brainstorming, looking for a better definition that can really grant the freedoms it aims to grant, including the freedom to study and to modify the system.

3 Likes

Great, thanks Nick. Thanks for all the work!

I can’t comment directly on the hackmd but FWIW despite being on-and-off involved in the process over the past year or so (starting with a deep dive last year) and so by some measures a participant in the “co-design” process, I am not able to endorse v. 0.0.9.

The main weakness to me is the same one that’s been noted by @Shamar and others: the weak and underspecified notion of “data information”.

Data information: Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data.

This formulation mirrors the language used in the EU AI Act, which notes that models classified as open source face the requirement to

“draw up and make publicly available a sufficiently detailed summary about the content used for training of the general-purpose AI model”

It would be great if OSI took a firm stand on the issue of what exactly constitutes sufficient detail for a system to even qualify as open source in the first place. For my money, the requirement that a skilled person can recreate a substantially equivalent system doesn’t cut it. It is unspecific and hard to verify (what counts as ‘equivalent’? and how much wiggle room does ‘substantially’ provide?). It is also easy to bypass with synthetic data (cleverly engineered or not) that would allow model providers to obscure true sources and therefore evade scientific, legal and regulatory scrutiny.

The result of the current draft is that we have spiderman-pointing-at-spiderman type situation: the AI Act speaks of sufficiently detailed and the OSI definition does the same. Our paper —identifying the risk of arriving at “a single pressure point that will be targeted by corporate lobbies”— is proving to be possibly more prophetic that I was hoping it would be.

image

A reasonable question is: what would be better? The issue is far from simple and the devil is in the details (pun not intended but I’ll take it).

For what it’s worth, I think @samj 's contributions elsewhere on these forums offer a constructive way forward.

And also, as some have argued, one could always actually require data to be available (in versioned form) and accept that this means there will be few (but not none) truly open source generative AI models. I know that early in the process a choice was made not to do this; but I’ve also seen a lot of dissent on precisely that point in these forums and elsewhere.

2 Likes

I’ve provided comments privately but am going to try and help align perspectives if that’s possible. I’m reacting first to the section on “Data Information” which includes the statement, “so that a skilled person can recreate a substantially equivalent system using the same or similar data.” I hadn’t thought of open source as requiring reproducibility. In the open source software domain, I’ve considered copyleft as a licensing model that provides transparency of the inputs used to create the output and to generate reproducible builds. Permissive licenses which do not give you that capability in the license still license anyone the ability to reuse, modify, and improve.

Based on discussions I’ve been a part of, I keep hearing a similar spectrum for AI systems which ranges from open source (e.g. permissive) to open science (e.g. copyleft).

My view is that open source AI systems enable anyone to reuse and improve an AI system. On the other end of the spectrum are open science AI systems that enable anyone to reuse and improve an AI system while providing additional transparency and reproducibility. Reproducibility was never in the OSD, and there are millions of Apache, MIT, and BSD-licensed projects where the necessary artifacts to enable reproducibility are not required to be shared.

“Open source AI systems” should include the model architecture, model weights and parameters, source code required for the architecture and weights and parameters, configuration files, and (at least high-level) information about the data used to train the system. These artifacts should be individually or collectively published under licenses allowing any recipient to, without restriction, to use, study, distribute, sell, copy, create derivative works of, and make modifications to, the licensed artifacts or modified versions thereof. This is highly analogous to BSD/MIT/Apache-2.0 licensed source code that we all call “open source” every day of the week.

I will also note, we’re already seeing copyright trolls going after datasets - pinpointing datasets with specificity would open the floodgates. There are also unsettled laws and regulations that will have to deal with data disclosures and what is appropriate or not. I don’t see how OSI’s definition could “override” these macro-level factors.

You may not have all the scripts to reproduce a model, but under this approach what has been licensed authorizes recipients to do what they need to reuse and improve it. It’s unclear why anyone would need more than the MOF Class III requirements to be “open source” if the expectation is to be focused on reuse and improving the system. Under “Code” the requirements for including code beyond the model architecture and model parameters doesn’t match the reality of what people are already doing to reuse and improve AI systems - without all the other training and validation code, people are already doing useful things with “open” models. MOF Class III only requires the code necessary for the model architecture and model parameters. If the expectation is to be able to reproduce an AI system, that’s where you get into open science.

In this spectrum, “open science AI systems” would be open source AI systems that additionally include licenses to all source code, training data, configuration files, research, and documentation required to reproduce a similar AI system without restriction. This is similar to what is expecting of GPL-2.0 licensed source code which when distributed as a binary requires the recipient to have access to the source code code plus the scripts that are required to build and install the source code.

My concern is that the current draft incorporates terminology many in our AI communities associate with open science into the open source definition. There are two approaches to “open” AI and in my opinion, open source and open science shouldn’t be merged into some compromised definition. It might help to add a definition of open science (the MOF does have one already), but I know the clause cited above has caused concerns about OSI’s draft in our communities.

Thank you for your comment @mdolan. Just clarifying that MOF Class III refers to the Model Openness Framework Class III Open Model. More details available here:

https://arxiv.org/pdf/2403.13784

1 Like

Hi @mdolan and welcome!

Thanks for your perspective, but I’m not sure I understood the parallel you are proposing.
The only difference among copyleft and permissive licenses is related to the licensing of derivative works.

In no way permissive licenses inhibit the freedom to study the covered software or restrict the freedom to modify such software to the equivalent of tweaking the configuration (fine tuning).

As for reproducible training, I’ve already explained before why it’s not really an issue, so much that TensorFlow team asks developers to fill an issue when exceptions occur with determinism enabled.

Also, as @lumin pointed out, with an Open Source AI definition that does not mandate training data, you end up with the paradox that Open Source AI could not be used in Science at all (open or not).

License type is orthogonal: open source includes both permissive and copyleft licenses, and open science demands neither at least for data, per the Model Openness Framework (MOF) (thanks @nick). It talks about OSI Approved licenses and open data licenses being more appropriate for data — collectively open licenses.

“In the MOF, Class III is the entry point and contains the minimum required components that must be released using open licenses.” It calls for the model architecture (code), [final] parameters (weights), and some documentation, which is barely useful for inference presuming you trust the vendor… and you can work out how to do inference yourself without the inference code! It’s basically the Llama model, and by not requiring the training code let alone data, it doesn’t even meet the low bar of @lumin’s ToxicCandy category, making it decidedly NonFree.

The MOF erroneously claims “Class III contains all the components required to study, modify, redistribute, and build upon a model without restrictions”, quickly walking that back: “However, this class lacks completeness and robustness for full reproducibility and the transparency needed to confirm all claims made by the producer. It also lacks sufficient components to evaluate the model, including the training data.” Either you do or you don’t so it’s hard to believe this isn’t errata.

Draft 0.0.9 and its checklist do make it into the ToxicCandy category however, as they demand training code without its data dependency: “Here’s where we dump the database of public Facebook/Instagram posts into the pot, but you can’t have access to those because reasons”. This maps directly to the MOF’s Class II Open Tooling as “an intermediate step between an open model and open science, providing a model consumer with information to test a model producer’s assertions”. It falls far short of protecting the four freedoms.

Class I Open Science adds all the datasets under “any license or unlicensed”. I’m replicating the MOF’s list below because this point is that important. They even repeat it themselves in section 6:

Accepted Open License (Datasets)
Preferred: CDLA-Permissive-2.0, CC-BY-4.0
Acceptable: Any including unlicensed

Even the MOF’s highest level of openness and completeness does not demand open licenses for the data. This is the only compromise we can safely make while still protecting the four freedoms, and it seems entirely reasonable especially given the precedent for exceptions (e.g., LGPL). It does give rise to an even higher Class 0 that requires open licenses for data, but that may be a job for the FSF as I think even Debian should consider coming to terms with this fact of life!

The MOF “acknowledges that most pre-training data is subject to copyright and therefore it is not possible to license the data. To this end, datasets are an optional component, with the caveat that datasets must be included for Class I (with any or no license).” This also addresses @quaid’s last post on the reality of data licensing. The key for data is access, “with datasets expected to be readily available without personal requests or paywalls, promoting transparency and enabling scrutiny.”

Fortunately, the MOF is prescriptive as to what is required of us to protect the four freedoms:

To achieve full transparency, reproducibility, and extensibility, we argue that model producers must go beyond just releasing their model and the trained weights and biases, which is currently the norm. Instead, they should include all artifacts of their work, including datasets for training, validation, testing, and benchmarking, as well as detailed documentation, such as research papers, model cards, data cards, and any usage documentation. Completeness also requires all code used to parse and process data, the code used for training and inference, and any code used in benchmark tests, along with any libraries or other code artifacts that were a part of the model development lifecycle.

Note that it includes both testing and training data (among others), per @stefano’s recent thread on the subject: Should testing data be made available to use an AI system?

Here’s the list of requirements from the paper:

  • Class I. Open Science [~= Open Source]
    • Research Paper
    • Datasets (any license or unlicensed)
    • Data Preprocessing Code
    • Model Parameters (intermediate checkpoints)
    • Model Metadata (optional)
    • And all Class II Components
  • Class II. Open Tooling [~= Toxic Candy]
    • Training Code
    • Inference Code
    • Evaluation Code
    • Evaluation Data
    • Supporting Libraries & Tools (optional)
    • And all Class III Components
  • Class III. Open Model [~= Open Weights]
    • Model Architecture
    • Model Parameters (final checkpoint)
    • Technical Report
    • Evaluation Results
    • Model Card
    • Data Card
    • Sample Model Outputs (optional)

@samj There is plenty of code on GitHub, GitLab and elsewhere that has an Apache 2, MIT, 3-Clause BSD, or other OSI-approved license that doesn’t build or is hard to modify without having some other code (e.g. build scripts) - is that source code “open source”? Do I need to be able to replicate the build system too? Are we now talking about GPL-3.0 as a license expectation? I don’t think you’re implying that, but I use it as a parallel. Reproducibility was never part of the OSD.

I think there is a valid argument that MOF Class III won’t meet certain use case requirements, and may not even be useful. The developer ecosystem and marketplace will decide whether it’s useful enough to use/adopt. And those that are lacking needed artifacts will likely fail to achieve any adoption - like millions of open source licensed repos on GitHub. However, I disagree with the notion that a Class III model published with an Apache 2 license is not “open source” licensed. I’m still free to use that intellectual property someone put out there and do whatever I want with it - how is that not “open source”? It may not be the most useful open source, but again, that’s for the developer or user to decide. It does not mean the IP is under a proprietary license. I can’t fathom stating an Apache 2 licensed model like IBM’s Granite model is not “open source”.

My other source is the AI ecosystem itself. Look at how models are licensed on HuggingFace and what they contain. Yes, there’s a lot of garbage - but the AI community is making good use of many models in valuable solutions and use cases without everything I’ve seen discussed in this ODAID debate. It goes against what we’re all seeing in the AI community to say many of these Class III models are not openly licensed, usable, improvable (like I’m already doing with RAG) and useful.

[quote]Also, as @lumin pointed out, with an Open Source AI definition that does not mandate training data, you end up with the paradox that Open Source AI could not be used in Science at all (open or not).
[/quote]

Yes, that’s true. And it will be a limitation for models that want Open Science AI systems to use their model - it won’t happen. Just like there are limitations on the compatibility between some permissive “open source” and copyleft “open source” licenses that do not allow for certain combinations. But these are restrictions based on the requirements of the user. If I’m building an open science model, I cannot provide aspects of reproducibility if I rely on a foundation model that is MOF Class III. That doesn’t prevent me from using that IP or some of it. Maybe I don’t use that Class III model itself, but I borrow some of the Apache 2-licensed model architecture code because it was well done. I can do that and build an Open Science AI System using just that Open Source AI System’s model architecture code. That is possible - if the model architecture has an open source license. Per my other point to @samj, some of this will be useful/not but the fact that it has an open source license on it allows me to use/modify/improve however I want under the terms of that license. I’m also not requried to use any of it so if there are artifacts in the data transparency missing and I don’t like that, I just won’t use that model.

IBM’s Granite model is not Open Source.

There, I said it, and you can quote me on that. The problem is that their explanation doesn’t even describe the training data, let alone make it available:

All the models were trained on data that was collected in adherence with IBM’s AI ethics principles and with the IBM legal team’s guidance for trustworthy enterprise use. These Granite Code models are released today under the Apache 2.0 license.

Without the training data — analogous to source code in the AI context, while models are the resulting binaries — you do not enjoy unfettered freedom to use, study, modify, and share the model, so it cannot be classed as Open Source: “The source code must be the preferred form in which a programmerpractitioner would modify the programmodel.

The preferred form for model modification for a practitioner is the training data (plus any instructions and/or code that manipulate/filter/etc. it prior to training, so the system sees the same set as input), and while there are limited modifications one can do without the data (e.g., fine-tuning), you absolutely cannot “do whatever [you] want with it”. As but one example (one’s enough), how will you filter objects containing the word “nazi” from the source with only the weights? Open Source does not and must not limit the modifications you can make. RAG isn’t even modifying the system, rather exploiting it verbatim and tweaking its inputs. Neither is using a gatekeeper to filter “nazi” references in the output. This is an extreme but real example: Microsoft shuts down AI chatbot after it turned into a Nazi.

Rewinding a bit:

No, and I’m yet to take a strong position on whether it’s critical here either. In general, I don’t think we need to expand the definition of Open Source, but we do need to meet it. The MOF authors seem to think it’s useful, arguing the model parameters “format should be compatible with deep learning frameworks, such as TensorFlow, Keras, PyTorch, or the framework-independent ONNX file format”, but used a “should” rather than “must” requirement level. The existing OSD gives us clear guidance here:

The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.

The MOF authors also think IBM erred in releasing it under the Apache 2 license by the way, but this is another implementation issue that can be resolved once we’ve moved the big rocks:

To date, model producers have been releasing model parameters (i.e., weights and biases) using an open-source license, such as Apache 2.0 and MIT, even though model parameters are not compatible with OSS licenses. Since model
parameters are in fact data, producers should use an open data license, such as CDLA-Permissive-2.0. Although licenses designed for OSS are permissive and indemnify the developer from liability, open data licenses are better suited to data-specific considerations such as privacy, ethics, and data rights.

The MOF was introduced at the start of townhall 5 and referenced in virtually every one since, including the latest on Friday in which @stefano gave a good explanation of its relevance. It’s been relied upon heavily in the creation of the draft and especially the checklist, yet its own clear position on what is required in terms of openness and completeness seems to have been missed or ignored. Sure, it references “reproducibility”, but I’m not sure you can get to full transparency (to study) and extensibility (to modify) without picking it up on the way… maybe by providing the data but omitting pre-processing code?

To achieve full transparency, reproducibility, and extensibility, we argue that model producers must go beyond just releasing their model and the trained weights and biases, which is currently the norm. Instead, they should include all artifacts of their work, including datasets for training, validation, testing, and benchmarking , as well as detailed documentation, such as research papers, model cards, data cards, and any usage documentation.

This is not entirely unlike the faulty voting data issue in that it’s been used to make and/or justify the decision/s, but it’s like the quote isn’t in the citation given. There may well be differences between MOF Class I and what it means to be “Open Source”, and we need to find and refine them. If this is you, maybe you could ask the LF authors what they meant?

You’re correct that IBM granite models would not be “open source” today, as released, and as per the latest draft of OSAID. But I don’t see how they could not be, with a documentation effort which would definitely not be enormous. (In fact, an effort that would be very similar to what the industry is moving towards anyway, in the different but related context of SBOMs.)

IBM could document the full list of source code files they trained their models on, e.g., as a big list of checksums.

  1. For the training dataset subset that consists of open source code (YAY, happy face here), they can even redistribute them.
  2. For the subset of files that are publicly available, but not open source, possibly not even redistributable by them (grmbl, sad face here), they can point to where they are and provide their checksums for verification purposes.
  3. For the subset of files that are not even publicly available (very sad face here), they can document how they obtained them and their checksums (in case others can obtain the same files via the same or different ways).

As far as I can tell, this would satisfy the current OSAID draft.

Sure enough, it would not be as good as having the entire training dataset (which cannot be made legally available anyway, unless it falls entirely in case (1) above), but it would be better than the current status quo of Granite models.

This is, in fact, a very good example of how OSAID can push the AI market forward in a good direction (even if it doesn’t get all the way to the place I would have personally favored).

Cheers