Proposal to handle Data Openness in the Open Source AI definition [RFC]

thesteve0 · September 13, 2024, 10:41pm

I just don’t understand why we want to spread the use of the words Open Source so much. We don’t do it for software - ala Mongo.

If a software project said they were keeping two libraries in their project proprietary but they documented it really well there is NO WAY they would be called open source.

If software companies have proprietary source but they share they binary for free we still applaud them but say your model just doesn’t fit with open source.

In the AI/ML world as compared to software:
The weights corresponds to a compiled binary. Most users are interested in only the weights and weights are the thing “released” and versioned.
The data AND the code correspond to the source code. Most users will never touch it or even look at it. I can take those two pieces and recreate the weights. I combine/compile them into the weights.

Thought experiment:
Imagine every single PostgreSQL binary in the world disappeared this instant. I could immediately recompile PostgreSQL and have a working PostgreSQL.

Image every single Llama weights definition disappeared this instant. There is no way I can rebuild those weights without access to the data and the code.

I still have yet to see any data scientist or AI/ML expert explain how they could recreate the model they are distributing without access to the data.

There is no logical way you can abide by Open Source Principles without access to the data and the code used to make the model. The fact that they compile that and then give the weights away for free is really great. That still doesn’t make it open source.

Call it open weights, free weights, open model… NOT open source.

The degree to which OSI is allowing vendors to influence the definition and secrecy of the process has me really wondering what is going on. The insistence on following this non-free definition for AI has really caused to lose respect for OSI.