The Missing Third Leg: Training Data Excluded from Open Source AI Definition by [Co-]Design

Thank you for doing this merging/comparing of ideas — and I <3 @samj with the OCI comparison, and yup this problem has been around for a while but there never was much attention on it until the “discovery” of Big Data…

One nuance I see in these patterns highlights the conflicting challenges we have:

  • OSAI components are licensed in layers, especially the data (explanation to follow)
  • Adherence to strictness of definition for each layer, every single time, versus…
  • Vernacular usage that will continue to ignore when one of the deeper layers is non-Open

Models often have a complexity of licensing when you get inside of them that depends on the specific objects of the original dataset — the JPEG of an ancient porcelain vase used in a dataset may be under the copyright of a museum that didn’t license the image … except for the acceptable use policy on the museum website that was in effect in 2019 when Common Crawl came through. And now there are two blocks to the data objects being actually Open, the lack of license from the museum and the terms of use for Common Crawl.

Here is the just-released specific example I’m thinking of, which claims to be Open Source but only is for the part the creators could license:

Researchers from ByteDance have released a multimodal dataset designed for complex mathematical reasoning (article). A glance at the arXiv paper InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning and read of the model card on HuggingFace quickly reveal these layers:

  1. The InfiMM-WebMath-40B multimodal dataset is under the odc-by (Open Data Commons - Attribution license), which is OSD-compliant.
  2. However, the actual documents used — 85 million image URLs and 24 million web documents — came from Common Crawl data snapshots from 2019-2023 (with filtering, etc.) Common Crawl data seems to be under their terms of use, which is definitely not OSD-compliant.

Another simple example would be a model based on geospatial datasets where the actual satellite images were not available under an OSD-compliant license, but everything else is.

In this example, a model creator could release all the code, weights, other metadata and filters and configurations, but the actual JPEGs would not be re-distributable. Yet a person wanting to recreate this model could obtain the images themselves directly from the source, which might even be at no-cost but also come without a license to redistribute.

This last example is based on one brought to me recently and frankly, I’m stumped. The model creator can be the best Open Source actor in the world, ready to collaborate on everything, but since they don’t operate the satellite and have to source the images from entities with different needs and rules, what can they do to participate in the Open Source ecosystem?

If I release an otherwise OSAID-with-Open Data compliant model under these conditions, and I put in the README the exact steps to download the images from a .gov FTP server for zero additional cost, where is there for me to land in this Open Source AI ecosystem?

Completely setting aside nefarious actors, it seems that having the dual-branding (like D+/D-) gives a way for model creators from research, academia, NGOs, and startups to participate in the OSAI ecosystem on a more equal footing, yes?

1 Like