I don’t have a settled view on whether access to training data (whatever access might mean) should be seen as necessary for a model to be “open” (though I’m actually skeptical about this). But I want to make one point about practicality: If you insist on “open” training data, at least if “open” means approximately what those of us in the FOSS universe understand it to mean now, then as a practical matter there will be few, if any, open source models other than perhaps some toy examples or research models of limited practical interest and utility. That’s because I believe that for the kinds of models attracting current user and policy interest (e.g., LLMs), it is not possible to train a performant model from scratch entirely from data items known to be under libre terms. Am I misinformed about this?
Even if you relax the historical definition of “open” to embrace, say, licenses that prohibit commercial use, that won’t solve the problem at all. And I hope the OSI is not considering departing from the bedrock principle that “open” cannot mean “commercial use prohibited”.
The insistence on “open” training data also ignores the role played by doctrines like fair use in the US. Open source in the traditional sense has always relied in part on the existence of limits to copyright. Fair use might permit much of the training on copyrighted data items happening today, but such data in general won’t be susceptible to availability under “open” terms. Why shouldn’t open source AI be able to take advantage of fair use and other limits on copyright?