I would like to bring to attention an announcement from IBM about their contributions to LF AI & Data Foundation (they gave a presentation about this at the last TAC meeting):
In particular, of the three projects that IBM has contributed, I would like to highlight two that are relevant to the OSAID:
Docling: it streamlines the process of turning unstructured documents into JSON and Markdown files that are easy for large language models (LLMs) and other foundation models to digest.
Data Prep Kit: it helps clean, transform and enrich unstructured data for pre-training, fine-tuning and RAG use cases.
These projects are important because they might demonstrate how IBM processed and filtered the data used to train their Granite models for example. As you know, the OSAID requires:
“The complete source code used to train and run the system. The Code shall represent the full specification of how the data was processed and filtered, and how the training was done. Code shall be made available under OSI-approved licenses.”
IBM has also published an article describing their training data:
I believe that this is encouraging, what are your thoughts about Granite?
Granite 4.0 is being positioned as an enterprise-ready alternative to conventional transformer-based models, with particular emphasis on agentic AI tasks such as instruction following, function calling, and retrieval-augmented generation (RAG). The models are open sourced under the Apache 2.0 license, cryptographically signed for authenticity, and stand out as the first open language model family certified under ISO 42001, an international standard for AI governance and transparency.
More information about its license and how it qualifies as a Class III Open Model here, but it doesn’t mention the OSAID: