Most importantly – I have to emphasize this again – please do not allow “open source AIs” to hide their original training dataset. This is not beneficial and not constructive to the whole ecosystem for the long run.
Releasing pre-trained model while nobody else can reach the dataset is tyrannical. The whole ecosystem can only develop downstream application based on it, but never possible to fork the original work to improve the original model itself. This will discourage competition, and greatly encourage monopoly with dataset barriers.
Really, please carefully think about this and avoid making a historical mistake.
If controlled experiments are not possible at all with “open source AI”, it becomes a dead concept for science.