Please find the final Release Candidate of the Open Source AI Definition:
The changes are to harmonize the text in data and code requirements to use the term data processing and filtering, to be more in line with general AI/ML practice.
In Data Information
RC2 says : (1) the *complete* description of all data used for training [...] (response to comments on hackmd by Manset (ITU), Huang (Samsung))
Removed the footnote 4, “The Open Source AI Definition does not take any stance as to whether model parameters require a license…” as it only created confusion. Updated the FAQ instead.
fixed typo
Comments are welcome here on the forum and on HackMD:
As usual, I translated this definition into Japanese and carefully looked for anything, but at this point, I have no comments. Yes, this is likely to be the first version where I won’t leave a comment on HACKMD.
Using the term “complete description” was a brilliant idea. I believe many people will agree with the explanation that “detailed information” must include a “complete description.”
I fully understand that there are still some who may not be satisfied with this change, but I believe this is the most realistic option at the moment.
The Preamble should mention the limited intent of the definition.
There are already lively communities around machine learning that are equivalent to the open source communities around code. Individuals and organizations share machine learning artifacts and data sets. Some businesses exist on top of that. This definition says that they are not doing open source. I don’t know if that will cause any grief. But it is possible that these communities will coin their own term, meaning that “open source” is lost forever as an umbrella term. It may even cause a lasting split
Another problem is that the European AI Act explicitly mentions “open source” without defining it. The AI Act makes very complex demands which can only be navigated by professional legal departments. But there are a few exceptions that make open source AI a bit less impossible for the little guy. If this definition convinces judges that open source means extensive documentation, then this tiny chance will be undone.
Obviously, this definition is meant for research purposes and is not meant to apply generally like the software definition. That should be made clear.
A more technical point. The definition assumes that training data is used. Are other projects also out of scope?
There should be a warning regarding the legal uncertainties.
Publishing details about training data means - at present - to volunteer to litigate a precedent. Maybe this is different in Japan. The Japanese TDM-exception is well thought-out. But it is the case in the EU and the USA. Having to pay damages or fines is a very real possibility, especially in Europe.
Maybe the intended audience knows this, but students or hobbyists can’t be expected to know what they are getting themselves into if they try to follow the definition.