Open Source to require documentation for the first time for AI?

Open Source, including as specified in the Open Source Definition (OSD), has never required documentation.

It limits itself to the following 3 requirements:

  1. The source code must be the preferred form in which a programmer would modify the program.
  2. Deliberately obfuscated source code is not allowed.
  3. Intermediate forms such as the output of a preprocessor or translator are not allowed.

Documentation (whether inline or external to the code) may be “appreciated” and is often included, but never required. Even the LICENSE file itself is not required as a reference is adequate.

A company can release insanely complicated firmware code for a quantum computer, stripping out all comments, and it would still be valid Open Source software that protects the four freedoms in that it can be used, studied, modified, and shared, provided it were released in the “preferred form”. Even if it were written in assembly, that would be the preferred form, but not if that were merely the intermediate form of higher-level source code, or had been otherwise obfuscated.

Before we enter this brave new world, we should make a deliberate decision that this is an acceptable new demand.

I foresee it proving problematic for several reasons, but primarily because it’s impossible to specify. We know that source code satisfies the openness requirement from its license, and completeness simply from its ability to produce the software — either it does, or it doesn’t (functionally, whether or not the hashes match, in that reproducibility has never been a requirement either). If you require a bad actor to deliver documentation, they can exercise malicious compliance and add a single line to check your box:

# f@%k you

This problem manifests itself in weasel wording in the release candidate such as “sufficiently detailed information”, “skilled person”, and “substantially equivalent system”. These are terms you might see in a legal contract, the critical difference being that the code of the contract is ultimately parsed by a judge should an exception be raised (programming and law are remarkably similar in that sense).

To the extent there is a judge, it is the OSI. Unlike OSI Approved licenses which are reviewed once and applied many times, the OSAID in its current form needs to be directly and manually applied every single time. Even then there was a project that produced a report on the license proliferation “problem”. Does the OSI intend to adjudicate application of the OSAID, and if so, will it publish its decisions as it does licenses, knowing it could be sued by vendors, whether right or wrong, like its founder was? If not, who will?

As I see it, in addition to being unacceptable due to the new documentation requirement which significantly expands the scope of Open Source, the OSAID is unimplementable and should be rejected by the board as such (at a minimum the board should insist on a reasonable testing period). It is also unenforceable in that anybody can claim to be “Open Source AI” and nobody will be able to do a thing about it because the test is subjective rather than objective: a matter of opinions. With Open Source we simply check the license is on the list and confirm that the source produces the software. It may be that a critical intermediary is also missing in the form of specifications like MOF Class I, being analogous to OSI Approved licenses.

I’ll set aside the “Documentation Requirement” for now, as it seems to be part of the ongoing discussion about data information. However, I do agree with the claim that the current OSAID is both “unimplementable” and “unenforceable”. While I acknowledge that the wording of the definition has been gradually evolving in a positive direction, when OSAID 1.0 is released at the end of the month, what exactly will OSI and the volunteers supporting OSI be expected to do?

It’s clear that reviewing each AI system one by one will be too much to ask of the existing volunteer effort, and as Samj-san pointed out, I don’t believe that these decisions can be made objectively. If OSI continues to only review licenses (legal conditions) as before, there could be significant opportunities for providers of AI systems to engage in misrepresentation. Additionally, relying solely on a process that has historically focused on copyright and license reviews could place a heavy burden on the existing review process.

I assume that Stefan-san and the board are aware of these issues, but at the very least, the general OSI membership, including myself, have not yet been informed about any plans. The first priority should be releasing the checklist aligned with RC1’s contents. However, even if a valid OSAID 1.0 and checklist are completed, simply publishing them won’t give those in the Open AI/ML industry a clear idea of how to implement or apply them. This is what concerns me. While it is not my place to decide if OSAID 1.0 should be released on schedule, I remain unsure if this is the right course of action given the current state of affairs. I worry about divisions within the community.

That said, part of me also believes we should try releasing it.

When the “Open Source Definition” was created, it was derived from the lessons learned during the early days of the Debian Project about what constituted free software and what the conditions for freedom were. In other words, there were already various established practices within the community, and the OSD was a collection of those experiences.

In contrast, the concept of “Free and Open” within the AI world hasn’t had nearly the same accumulation of experience as we had in 1998. Furthermore, related legal systems, interpretations of copyright law, and laws around publicity and privacy vary widely between countries. In such a context, if we don’t release the definition soon, there may be no industry-wide push to adhere to OSAID’s four freedoms, nor the motivation to work towards a complete definition within the AI/ML sector.

Open Source Group Japan, the guardian of the term and trademark “Open Source” in Japan, and I support OSI’s ambitious challenge to define “Open Source AI”. However, even if OSAID 1.0 is released, we will not immediately fully accept it. Instead, we will continue to question whether it is a valid definition of a free and open AI system for Japan’s AI/ML community. It may take years before we decide whether this definition should be accepted, and there is a chance that OSAID could eventually be deemed ineffective. Nonetheless…, if by that time we have managed to prevent open-washing, I would consider it a success.

Hello @samj and @shujisado ,

Although not required initially, what I’ve been observed is that poorly described and documented projects tend to be abandoned unless a strong volunteering effort (or enterprise sponsorship) is done to correct this.

CHAOSS metrics have established two measures:

Several OS governance projects also address this:

So, although not a requirement of Open Source on its own, it has became a staple of well managed projects.

In the particular case of Ai systems, without a good description on why a particular model and dataset was chosen you are left with very little reasons to adopt it.


As for the other matters, I’ll refer back to another discussion: The ZOOOM approach on Asset Openess - #3 by gvlx

Yes, I agree, but that has never been the purpose of the OSD or the OSAID.

It could be part of some research, as we have already seen with:

A. Liesenfeld and M. Dingemanse, ‘Rethinking open source generative AI: open washing and the EU AI Act’, in The 2024 ACM Conference on Fairness, Accountability, and Transparency, Rio de Janeiro Brazil: ACM, Jun. 2024, pp. 1774–1787. doi: 10.1145/3630106.3659005.

Or the work of some specific organizations:

Presently, a project claiming to be Open Source either:

  • uses an OSI-approved license
  • creates its own license and submits to the OSI and it is approved
  • does none of the above and its claims are contested by the community

I do not believe Ai systems should behave differently.

These are very valid concerns and it seems to me to be the main motivation for OSI Board push for a final version in October.

However, I would prefer we could reach some “Rough consensus”, which will require new neutral moderators on this forum.

Given all that has been said, there should be an effort to bring together stakeholders (which includes communities and enterprises) discussing the details on the checklist and the FAQ during open sessions.

Agreed, but that doesn’t make it any less Open Source.

Agreed. This is why the IETF process allows for appeal, and with @stefano being at the top of the tree that means the board and its imminent vote. I trust they’re tracking these developments given the gravity of their decision and the availability of alternatives (e.g., allow open access to datasets, demand further testing internally, insist on a long beta period, require rough consensus, or delay the launch pending developments like truly-open-except-for-the-encoder Molmo).

Any finding of rough consensus needs, at some level, to provide a
reasoned explanation to the person(s) raising the issue of why their
concern is not going to be accommodated. A good outcome is for the
objector to understand the decision taken and accept the outcome,
even though their particular issue is not being accommodated in the
final product. Remember, if the objector feels that the issue is so
essential that it must be attended to, they always have the option to
file an appeal. A technical error is always a valid basis for an
appeal, and a chair or AD has the freedom and the responsibility to
say, “The group did not take this technical issue into proper
account.” Simply having a large majority of people agreeing to
dismiss an objection is not enough to claim there is rough consensus;
the group must have honestly considered the objection and evaluated
that other issues weighed sufficiently against it. Failure to do
that reasoning and evaluating means that there is no true consensus.

Yes, I could probably point to some projects which, despite having none or lousy documentation, have a good number of followers (mostly very technical projects where everyone involved has deep knowledge of the matter)…

And they remain firmly on Open Source ground.

1 Like

I think @samj makes an excellent observation in OP: that documentation is turning out to be one of the key asks of the new OSAID, and that this is a deviation from prior “Open” definition efforts.

Importantly, one reason open source software has never needed to ask for it is because once all aspects of a system are truly open, the system is in a sense self-documenting. I don’t need to separately list all dependencies of an OSS software package; they are part of the code that can be inspected. (Of course it’s nice to also document them, and given some utility and a critical mass of users this will happen anyway, precisely because it is all open in the first place.)

If, on the other hand, parts of a system have been kept under wraps for whatever reason, and then the model provider is asked to provide “documentation”, they are going to do the absolute minimum required because to do more is only to increase the risk surface for legal, ethical, and scientific scrutiny.

So I am increasingly of the opinion that asking model providers to merely provide some data information (a form of documentation) is not going to suffice. I think RC1 is a step in the right direction but am mindful of Sam’s point that water is always going to find the lowest level I do wonder about the room left in the implementational details to just supply some judiciously selected data information and call it a day.

2 Likes

While publicly funded research like ours can accomplish something (e.g. build awareness of open-washing), we definitely don’t have the resources or the inclination to turn any of this into a certification exercise. Our job was in a way the “easy” one: to spot loopholes and identify pressure points for corporate lobbying. I don’t envy anyone trying to come up with a definition and in fact I think the original sin of the EU AI Act was to tie the notion of open source to a license in the first place (as we also write in the paper):

But anyway, that is a bit off-topic since I realize we are here, at the OSI forums, discussing the open source AI definition.

Perhaps, but has always been a request from practitioners and it could be said it is implicit in the “use” of an application.

But it is important to keep in mind.

Thanks to all who’ve chimed here and to @samj OP.

Documentation, while always welcome, is necessarily separate from the code (even if it is derived from comments in the code). You could read a document and expect a particular behaviour but something else happens. And we will be wondering what’s going on.

Over the last few weeks, after reading all that I could in the various fora on this topic, I am convinced that the training data MUST be made available and be part of the OSAI definition.

We are heading to a deadline for the release of v1 but don’t let that stop us from doing what is needed.

I’d like to look at scenarios:
a) OSAID is released without including training data as a prerequisite. There will be elation in the tech space and because of this, even greater open washing will happen. This is NOT a desired scenario.
b) OSAID is released with the training data needing to be released on Creative Commons or some form of license that is open. There will be hue and cry from many and who might characterise OSI as being behind the times. Or that, OSI has no idea how data should be made available. All of this will be something we have to anticipate and hold on to the priniciple of (and as @samj has noted many times), the Four Freedoms.

I don’t think there is a third scenario that is worthy of enumerating.

I note also @quaid’s models and matrix ideas. I do like them, but it adds a degree of cognitive overload that, imho, would not do justice to what we are trying to achieve.

We have to do this right and do this well.

We should follow the KISS priniciple so that there is clear and wide adoption.

3 Likes

Strictly speaking, neither does the the Open Source AI Definition. You’re misreading or misrepresenting the requirements as you overflow this forum with messages, some even autogenerated.

The Open Source AI Definition asks for the “preferred form of making modifications to an AI system.” That turned out to be a list of required components. Some of those components are documentation, technical reports, metadata (like model and data cards etc.) Other requirements are code, Finally, there is a requirement on a new thing that we don’t require for software either: the weights and biases (parameters). That’s it. BTW, this is the same approach reached by the MOF and Meeker’s Open Weights Definition.

Sorry @stefano but I’m a bit confused.

The RC1 states:

  • Data Information: Sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system. Data Information shall be made available under OSI-approved terms.

How would you convey such information without documentation?

Or maybe you mean that Data Information are not even required, “strictly speaking”?

I didn’t realise there was a prohibition on constrained, disclosed use of AI in a discussion about regulating same, but if the proposed definition that took [three years] to develop can be deconstructed by a random word generator in seconds then perhaps that’s not where the problem is? By publishing both prompts and completions I aim to eliminate perceived bias and demonstrate that my/our arguments are sound.

The Open Source AI Definition asks for the “preferred form of making modifications to an AI system.” That turned out to be a list of required components. Some of those components are documentation, technical reports, metadata (like model and data cards etc.)

That is the draft definition’s hypothesis, but evidence (like unresolved objections “overflowing this forum”, and my straw poll with 87% of practitioners demanding datasets, since we’re doing democracy) suggests that’s not at all the case. We will certainly see in due course, but hopefully without doing irreparable harm to our shared cause.

Therefore, I’ve asked Google’s Gemini Pro via their NotebookLM (because it uses and cites specific sources) to:

Create a new MOF class that focuses on the bare minimum for “Open Source AI,” protecting the four freedoms at the system level by creating a table categorising each of the MOF components into one of three buckets:

  • Open License
  • Open Access (like datasets in MOF Class I which are to be available under any license or unlicensed)
  • Not Required

Don’t introduce new requirements not typically required of open source (e.g., documentation), except as needed to protect the four freedoms at the system level (that is, guaranteeing the freedom to use, study, and modify the system).

Don’t propose a solution that either:

  • doesn’t fully protect the four essential freedoms at an AI model/system level
  • unnecessarily protects additional freedoms such as reproducability or transparency (except in so far as required for protecting the four essential freedoms).

Here’s what the cloud brain came up with, documentation being conspicuous in its absence (apologies for its hyphenated “open-source”):

A Minimal Open Source AI MOF Class

The sources you provided don’t seem to directly address the concept of a “bare minimum” MOF class focused solely on the four essential freedoms of open-source AI. The existing MOF classes aim for varying degrees of completeness and transparency, which inherently encompass those freedoms but go beyond the bare minimum requirement.

However, based on the existing MOF structure and your requirements, we can extrapolate a potential new class:

MOF Component Open License Open Access Not Required
Datasets
Data Preprocessing Code
Model Architecture
Model Parameters
Model Metadata
Training, Validation, and Testing Code ✓*
Inference Code
Evaluation Results
Evaluation Code ✓*
Evaluation Data ✓*
Supporting Libraries and Tools ✓*
Model Card
Data Card
Research Paper
Technical Report
Sample Model Outputs
Model Openness Configuration File

Justification:

  • Open License Requirement: This new class mandates open licenses for components crucial for using, studying, and modifying the AI system. This includes the Model Architecture, Model Parameters, Data Preprocessing Code, Inference Code, and Model Openness Configuration File. 1 2
  • Open Access for Datasets: Following the MOF Class I approach, datasets should be accessible under any license or even unlicensed to guarantee the freedom to use the AI system. 3 4
  • Not Required Components: All other components related to documentation, evaluation, or additional tooling are deemed non-essential for protecting the four essential freedoms and are thus marked as “Not Required.”

Important Considerations:

  • This proposed class prioritizes the four essential freedoms at the expense of reproducibility and transparency beyond the bare minimum.

  • The sources emphasize the importance of comprehensive documentation and open data for a truly open and trustworthy AI ecosystem. 5 6 7 8 9 This minimal approach might not be ideal for fostering trust and collaboration within the AI community.

This response builds upon the structure and principles outlined in the sources but proposes a new MOF class not explicitly defined within them.

I then asked it to “justify the exclusion of Training, Validation, and Testing Code” and it ceded the freedoms to study and modify were only “partially” protected, promoting those I’ve marked with * from “Not Required” (which is to say they’re the ones that may be optional if anything is).

Why is this not a workable change for RC2, especially given the myriad (million!) models on Hugging Face that would become potential candidates, and for which the four freedoms would be fully protected? Conversely, what specific models would be excluded by this change that you would have hoped to include?

I’m not sure if using what clearly is proprietary AI is the best way to build a case for Open Source AI. In any case, let’s try to keep the forum free from AI generated content.

Just a note that it has been 3 years since the OSI started the journey to understand what’s Open Source AI. We have had conversations with experts and the community since 2022:

1 Like

I’ve noted that it was three years rather than one, but you’ve not addressed the question of what will happen if we were to introduce this change for RC2 (other than it being close to the arbitrary self-imposed deadline which can always be delayed)?

Nick-san, this is a bit off-topic, but I’d like to confirm something.

I believe that the OSI’s research into the relationship between Open Source and AI began with the Deep Dive program, announced in the May 2022 blog post titled “Exploring the Future of Open Source.” However, at that time, the plan to define OSAID was not yet finalized, and I understand that the plan was officially announced in the June 2023 blog post “Now is the Time to Define Open Source AI.” Is this understanding correct?

The first publicly released draft was in October 2023, and I began commenting on the HACKMD site around that time. This forum was established in January of this year.

At least for me, I had never even considered the idea of defining Open Source AI until June 2023 (as I thought the OSD was sufficient), and I remember being quite surprised by Stefan-san’s blog post. I suspect many in the community felt the same way. When the timeline of “2 or 3 years” is emphasized, I worry that many experts and members of the community who were not involved in the discussions during the non-public period may feel a sense of exclusion.

Whao, wait, what @nick? You mean to say the interview with @Lumin I’ve already cited was part of the OSI’s “deep dive”, and his input was ignored? Was he not clear enough? I can’t even…

[00:07:XX] SF: […] The neural network that has been trained to detect cats and dogs, now, if we wanted to distribute that piece of software inside Debian, or inside one of the few free software, mobile open-source systems to help retrieve our pictures, what do we need?

[00:08:20] MZ: Actually, we need lots of things, especially if we are doing distribution of free software. If we create a artificial intelligence application, we will need data. We’ll need the code for training neural network. We will need the inference code for actually running the neural network on your device. Without any of them, the application is not integral. None of them can be missing.

[00:08:52] SF: The definitions that we have right now for what is complete and corresponding source code, and how can it be applied to an AI system to an application like this that detects pictures of dogs?

[00:09:04] MZ: Well, actually, the neural network is a very simple structure, if we don’t care about its internal. You can just think of it as a matrix multiplication. Your input is an image and we just do lots of matrix multiplication, and it will give you a output vector. This is simply the things happened in the software. Both training code and the inference code are doing the similar thing.

Apart from the code, the data is something that can change. For example, we can use the same training and inference code for different data set. For example, I released a code for cat and dog classification problem, but you can decode and you say, “Oh, I’m more interested in classifying flowers.” Then you can collect new data sets about different kinds of flowers and use the same code to train the neural network and do the classification by yourself.

If you want to provide a neural network that performs consistently everywhere, you also have to release the pre-trained neural network. If you are releasing free software that also requires you to release the training data as well, because free software requires some freedom that allows you to study, to modify or to reproduce the work. Without any training data, it is not possible to reproduce the neural network that you have downloaded. That’s a very big issue.

Nowadays, in the research community, people are basically using neural networks that are trained on non-free data set. All of the existing models are somewhat problematic in terms of license.

1 Like

Hi @Mark you make a valid point about the license, and not just the definition, being a crucial point in the AI Act framework. Seems to me that the licensing issue has been ignored by the EU - which for now sticks to the “free and open license” language in the AI Act, without investigating what this means in the AI space. which free and open licenses?

But also, the OSAID process has not really addressed the issue of licensing (although I remember it being discussed in the Deep Divewebinars, and that was very useful).

I think that there’s a bigger conversation to be had about licensing in the open source / weights / ish AI space, also with stewards of content licenses.