On the current definition of Open Source AI and the state of the data commons

Well, that system would qualify as OpenSource AI as long as all the documents pointed by the URL stay available and unmodified.

However, to comply to the definition, such dataset should also include a UTC timestamp of the document retrival operated to get the actual data to use in training and a cryptographic hash of the content of each document downloaded, as they are needed to check the dataset identity and integrity.

This way, users of the system could actually identify the inapplicability of the OSAI definition without wasting GPU cycle, cooling water and energy.

In any case, in the moment that the original training dataset becomes unavailable (in whole or in part), any system built from it would not match the Open Source AI definition anymore.

Thus I’d argue that it would be a shady approach that leads towards open washing. Yet, in some rare situations it might be an ethically reasonable approach.

Indeed, as long as the data stay available online and anybody can detect any corruption / removal / modification with a simple curl | sha512sum, the system could qualify as Open Source AI.