A new draft of the Open Source AI Definition: v.0.0.6 is available for comments

Draft v.0.0.6 is out, introducing significant changes in the section “What is open source AI”:

  • added, “Precondition to exercise these freedoms is to have access to the preferred form to make modifications to the system.” With an example of what this looks like for a machine learning system
  • the checklist to evaluate legal documents separates the required component from the optional ones. They reflect the results of the working groups
  • Minor change in wording from “license” to “legal document”.

The preamble is left untouched, more discussion seems to be needed.

Comment on the full draft:


At the risk of reviving the training data access thread, I’m very confused by what seems like an inconsistency in this document. The What is Open Source AI? section includes the following as a precondition to exercise the freedoms that an Open Source AI provides:

But the Checklist to evaluate legal documents lists model data in the optional section.

I’m not going to argue for one or the other, in large part because I haven’t decided where I stand on the question myself, but it seems to me that either

  1. The model data (or at least some of the sub-items underneath) can’t be optional because it’s a precondition to Open Source AI -or-
  2. The model data shouldn’t be listed as a precondition.

Or perhaps there’s some nuance that I’m missing here. But if it’s not clear to me, I expect there will be others who find it unclear as well.

Hi @ben-cotton,
I think this is not an inconsistency - the model parameters are not being classified as a “dataset” even though in a sense they are a kind of generated dataset. Or have I misunderstood your comment?

I’m not sure I follow… where do you see the words model data in the draft?

I don’t. That was sloppy wording on my part. But the broader point is that the inclusion of “data” as a precondition for an Open Source AI is inconsistent with the listing of “data” as an optional element in the following section.

I’m trying to reconcile “training data” being listed as both a precondition and an optional element. Perhaps the intent in the precondition is "you must be told what training data were used (e.g. ‘This model was trained on Ben’s Cool Dataset’) and the intent in the table is “it’s optional to provide the actual data from Ben’s Cool Dataset”.

Since the preconditions include three elements (data, code model), I would expect to see the checklist reflect that. Instead the checklist has only data and code in the required section.

Maybe my confusion is a matter of presentation and not substance?


Oh, yes, now I think I understand the source of your concern and you have a solid point, thanks for the careful review.

You’re right: the checklist needs to have a line about the data transparency requirements.

I’ll go update the evaluation checklist for the next phase. Thank you!

  • Code : The code used for pre-processing data, the code used for training, validation and testing, the supporting libraries like tokenizers and hyperparameters search code (if used), the inference code, and the model architecture.

The “model architecture” is an overlapped subset of “training, validation, and testing” code. Without architecture defined, the “training, validation, and testing” code cannot correctly load a model at all.

Thus, I suggest at least re-order the phrasing as the follows:

  • Code : The code used for pre-processing data, the model architecture, the code used for training, validation and testing, the inference code, as well as the supporting libraries like tokenizers and hyperparameters search code (if used), the inference code, and the model architecture.

This makes the items logically organized.

  • Model : The model parameters, including weights. Where applicable, these should include checkpoints from key intermediate stages of training as well as the final optimizer state.

The “model parameters” and the “model weights” are completely the same thing. Those are exactly what a “pre-trained” model is about. Those “parameters” or “weights” are automatically learned by the algorithm.

For instance, a “model parameter” can be the pre-trained Pythia 7B LLM model, dumped into the huggingface safetensors or pytorch state dictionary formats.

Do not confuse with “hyper-parameters”. Those are parameters controlling the model behavior through training or inference, but they are tuned manually by human or an “meta” algorithm (such as grid search) independent to the original algorithm.

For instance, a “hyper-parameter” can be the temperature for LLM during inference.

Most importantly – I have to emphasize this again – please do not allow “open source AIs” to hide their original training dataset. This is not beneficial and not constructive to the whole ecosystem for the long run.

Releasing pre-trained model while nobody else can reach the dataset is tyrannical. The whole ecosystem can only develop downstream application based on it, but never possible to fork the original work to improve the original model itself. This will discourage competition, and greatly encourage monopoly with dataset barriers.

Really, please carefully think about this and avoid making a historical mistake.

If controlled experiments are not possible at all with “open source AI”, it becomes a dead concept for science.


Related to this, it’s worth noting that in 2 out of 4 working groups, a majority of the group participants concluded that access to the training dataset was required to exercise the freedom to study the AI system: 5/9 participants for the Bloom working group and 4/5 for Plythia. (For OpenCV it was close, but below 50%: 4/9. I only compared with the number of WG participants, not with the number of voters, which might be lower, possibly pushing OpenCV above the 50% bar.)

For what it is worth: I’m not convinced that voting is the right way to decide on this matter in the first place. So I don’t think this is necessarily a flaw in the process. But if we assume that WG votes have at least some value, these results show that this matter is still a divisive topic among WG participants, rather than a consensual outcome in favor of making training data access optional in the definition.


Indeed, it’s a known fact that the debate around data is divisive. I also think that the board wouldn’t be particularly interested in approving a definition that is purely aspirational, containing a null set of meaningful AI systems.

We also need to move on with the process, we can’t continue debating the same thing without adding new elements. Let’s get to the next phase, check what happens with the draft the way it is (the whole training dataset not required) and record what objections there are to data transparency requirements as the next best thing after public access to the full dataset.

I had been concerned about the possibility of OSAID becoming a completely meaningless endeavor, but this 0.0.6 is the first time I have seen the possibility of it becoming a meaningful endeavor.

I recognize that there are still many difficult issues to be addressed, but I wanted to say this. Let’s all go forward.

1 Like

Thanks for that affirming comment, @shujisado. We’re trying really heard to make this definition just, cohesive, feasible, and excellent and encouragement is always appreciated. :slightly_smiling_face:

1 Like

My main concern with training data is to be clear between:

  1. A good description of the data used, and
  2. An actual downloadable archive of the data used.

Descriptions might make sense when the data is all public web data - one could imagine a long list of URLs.

A full archive of many terabytes may not always be practical to produce, let alone keep online forever, but I am certainly open to the notion of making it compulsory.

Either way, let’s be specific in the AI definition.

1 Like

Thanks for the reminder. Yes, we want to have clarity and we’d like your help.

In the current draft, there should be no doubt that this is not a requirement (what’s required instead is a good description of the data used.

Can you make an effort to point at what part of the text of the draft makes you think that a downloadable archive of the data used [to train the system] is a requirement?

As you’re looking at the text, please also make an effort to suggest alternative text that is more clear.


A post was split to a new topic: The data requirement: “Sufficiently detailed information” for what?