Draft v.0.0.9 of the Open Source AI Definition is available for comments

Under Japan’s copyright law, the reproduction of copyrighted works without permission for AI training purposes is generally considered permissible. Therefore, copyrighted works that are publicly available on the internet can be used for AI training without obtaining permission. However, the datasets created for training cannot be widely distributed to the public unless the copyright holders of those works have explicitly granted permission for such use. This follows the basic principles of copyright.

From the perspective of AI developers, this means that even if they have conducted AI training within the legal limits set by copyright law, they cannot freely distribute the datasets created for that training. AI vendors have no reason to obtain permissions from copyright holders for uses of copyrighted material beyond training purposes, and copyright holders have no incentive to consider any concessions specifically for AI training. This suggests that there is unlikely to be significant demand for completely free and open datasets.

Not only in Japan, but eventually in many jurisdictions, the reproduction of copyrighted works without permission for AI training purposes will likely be recognized. At the same time, making all the data used for such training completely free and open is unlikely to be permitted. However, if all the necessary data and the complete information required to create the dataset are made publicly available, then in principle, equivalent AI training could be conducted.

The current OSAID reflects this understanding and seems to strike a reasonable balance within the framework of global intellectual property rights. I think adding more specific language could be helpful, but so far, I haven’t come up with the right wording.