Training data access

It occurs to me that it would be quite problematic if “open access” were determined entirely by the risk tolerance of a distributor of third-party copyrighted material. One entity might release a model along with a training dataset that includes a lot of photos from Flickr, say, perhaps deciding to ignore the potential claims of authors of the photos or the licensing terms applied by those authors. Another entity might use the same training dataset and believe its use of the data in training is fair use, but might refrain from publishing the dataset because of concerns that in that context the dataset would be infringing. Why should the first entity benefit from the perception of having enabled “open access”?