Draft v.0.0.9 of the Open Source AI Definition is available for comments

Shamar · September 10, 2024, 12:42am

Such premise is wrong.

Given all the data and information, the bit-for-bit exact reproduction is possible even in deep learning models.
It’s even easy if the computing environment is properly set up.

It doesn’t matter: even compiling Debian from scratch is not a method accessible to everyone, yet we distinguish between software that is open source and software that isn’t.

Even the fact some legal system might prevent to distribute the training dataset is irrelevant: it will always be possible to create models form data that can be shared, so it’s not an issue for the applicability of the definition.

Furthermore hardware that is expensive today, will be cheaper in a few years both for usual market dynamics and because of antitrust law applied to trusts like the NVidia / Microsoft / OpenAI one.

Finally the Open Source AI shouldn’t be narrowly optimized for the currently hyped techniques (GPT anyone? ).
This is a field of frequent disruption and we should not pretend to believe that what is difficult to achieve today will be difficult tomorrow. In fact, still today, several AI techniques are trivial to replicate exactly given all the relevant data and informations.

If we weaken the OpenSource AI definition so that it allows proprietary software that nobody can really inspect to pose as “open”, we damage every truly open source AI that could and would provide the training data and we would disincentivize such basic transparency even when legally possible.

If we weaken the OpenSource AI with such huge loophole, it will be exploited from bad actors, extending the loophole to legal exceptions designed to benefit projects that really contribute to the commons by providing all the freedoms of open sources with full transparency.

We won’t have any way to verify that the dataset declared were actually used during the creation of the models, that they were used as described and so on. Users will run open-washed proprietary software they cannot really inspect, study or modify instead of truly open alternatives.

The fact that recreating certain deep learning models is complex and expensive doesn’t means that it’s impossible (and in fact, it’s always possible, if no relevant data are kept hidden) and the OpenSource AI definition shouldn’t allow systems that prevent the exercise of such right/freedom to pose as “open”.