If we properly equate the source dataset and the entire procedure of creating a model (data processing, filtering, tokenization, training, etc.) as the code, including the values provided by random sources during the whole process, so that the exact executables (the weights) can be compiled again from the sources, the definition would become crystal clear:
Data information: Sufficiently detailed information and all the data used to train the system (including any random value used during the process), so that a skilled person can recreate an exact copy of the system using the same data. Data information shall be made available with licenses that comply with the Open Source Definition.
And if you don’t include questionable data, you don’t get a good model.
First this is not true in general: it’s true that most hyped models are built on top of questionable data, but several AI systems are built from data that are not human related (such as weather or pollution data or those related to engineering fields) or questionable at all (such as aggregated traffic data over streets).
Morover that’s completely orthogonal to the OpenSource-ness of an artifact: for example, that are plenty of low-quality or bug-ridden (and even backdoored) JS packages on NPM (do you remember color.js
?) that were open source anyway.
On the other hand, a proper OpenSource AI definition that really grant the right to study and modify the system, might lead to better open and unquestionable dataset in those fields where they lack.