Training data access

Aspie96 · February 6, 2024, 3:06am

Personally, I quite strongly disagree. I’ll post here what I said in my comment to the latest draft (0.0.5).

I think data being available is really important, but considering it as a requirement for a system to qualify as “open source” would be unwise.

First, the availability of anything under proprietary licenses shouldn’t make something closer to an “open source” status.
The logical consequence might seem to require all data to be open source, but that would mean plenty of systems which are already commonly referred to as “open source” wouldn’t be.

Furthermore, the training data and the trained model are two different, separate assets and the model does not necessarily contain much information which is specific to the training data, as the learned features can be much more general.

Whether the model is open source, therefore, should be orthogonal to whether the training data is open source, which I think is also consistent with the OSD.

Instead of requiring training data to be available for a model to qualify as “open source”, a different, higher standard may be needed, to describe a model which is open source AND trained on fully open source data AND well documented.