Hi everyone,
I would like to bring to attention an announcement from IBM about their contributions to LF AI & Data Foundation (they gave a presentation about this at the last TAC meeting):
In particular, of the three projects that IBM has contributed, I would like to highlight two that are relevant to the OSAID:
- Docling: it streamlines the process of turning unstructured documents into JSON and Markdown files that are easy for large language models (LLMs) and other foundation models to digest.
- Data Prep Kit: it helps clean, transform and enrich unstructured data for pre-training, fine-tuning and RAG use cases.
These projects are important because they might demonstrate how IBM processed and filtered the data used to train their Granite models for example. As you know, the OSAID requires:
“The complete source code used to train and run the system. The Code shall represent the full specification of how the data was processed and filtered, and how the training was done. Code shall be made available under OSI-approved licenses.”
IBM has also published an article describing their training data:
I believe that this is encouraging, what are your thoughts about Granite?