Sorry, but I can’t follow your argument.
If the data sources are available and legally usable to train the LLM, a simple link with versioning and a sha512sum would be enough.
We are exactly in the same scenario you proposed above and the same solution apply.
The point remains: it’s perfectly possible to create a LLM that comply to a Open Source AI definition that requires the availability of the data used to train the model’s weights, "so that a skilled person can recreate an exact copy of the system using the same data”.
Well, it’s not the longest thread we have joined so far, but it has been a deep exchange on the supposed limits of a proper Open Source AI definition, that proved its applicability to interesting corner cases.
Anyway, thanks for the conversation and if you have further doubts or suggestions, I’d be happy to discusss them!