Overarching concerns with Draft v.0.0.8 and suggested modifications

juliaferraioli · May 19, 2024, 5:52pm

It’s not very liberal at all. Without the data, you don’t know what went into training the system. You could not recreate it, retrain it, or otherwise “recompile it”. A trained model, I posit, cannot be considered open source in and of itself without the data, processing code, and training code. The processing and training code even wouldn’t be enough. To simplify things, I find it useful to compare a trained model to a software binary (though I know Spot has issues with that comparison). We don’t call software binaries open source if the code behind it is not available and licensed as open source.

The initial training and evaluation datasets are hugely critical in understanding the underlying system, especially when making decisions about whether to use it as the foundation for your own modifications.

The question I keep coming back to is “could you recreate the model without the data?” Answers to that tend to be a flavor of “yes, well you can acquire your own data” but that undercuts the promise of open source and we won’t see nearly the innovation and empowerment that we have seen with open source software if we make data optional.