On the current definition of Open Source AI and the state of the data commons

Works created by U.S. federal government employees are generally in the public domain. There is no problem using these works for AI training within the United States. However, these works are not in the public domain in other countries. This means that while it is legal to use documents created by the U.S. government for AI training in the U.S., it may not be legal in other countries.

In my country, Japan, it would probably be permissible to use U.S. government works for AI training under Article 30-4 of the Copyright Act. However, making a dataset composed of U.S. government documents publicly available in Japan could lead to complex issues involving privacy, moral rights, and other considerations.

This may be a very narrow issue, and in practice, it might not pose any real obstacles, but would it still be considered “open-washing” if such problems led to keeping the dataset non-public?