Draft v.0.0.9 of the Open Source AI Definition is available for comments

anon18632855 · October 1, 2024, 6:05pm

IBM’s Granite model is not Open Source.

There, I said it, and you can quote me on that. The problem is that their explanation doesn’t even describe the training data, let alone make it available:

All the models were trained on data that was collected in adherence with IBM’s AI ethics principles and with the IBM legal team’s guidance for trustworthy enterprise use. These Granite Code models are released today under the Apache 2.0 license.

Without the training data — analogous to source code in the AI context, while models are the resulting binaries — you do not enjoy unfettered freedom to use, study, modify, and share the model, so it cannot be classed as Open Source: “The source code must be the preferred form in which a ~~programmer~~practitioner would modify the ~~program~~model.”

The preferred form for model modification for a practitioner is the training data (plus any instructions and/or code that manipulate/filter/etc. it prior to training, so the system sees the same set as input), and while there are limited modifications one can do without the data (e.g., fine-tuning), you absolutely cannot “do whatever [you] want with it”. As but one example (one’s enough), how will you filter objects containing the word “nazi” from the source with only the weights? Open Source does not and must not limit the modifications you can make. RAG isn’t even modifying the system, rather exploiting it verbatim and tweaking its inputs. Neither is using a gatekeeper to filter “nazi” references in the output. This is an extreme but real example: Microsoft shuts down AI chatbot after it turned into a Nazi.

Rewinding a bit:

No, and I’m yet to take a strong position on whether it’s critical here either. In general, I don’t think we need to expand the definition of Open Source, but we do need to meet it. The MOF authors seem to think it’s useful, arguing the model parameters “format should be compatible with deep learning frameworks, such as TensorFlow, Keras, PyTorch, or the framework-independent ONNX file format”, but used a “should” rather than “must” requirement level. The existing OSD gives us clear guidance here:

The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.

The MOF authors also think IBM erred in releasing it under the Apache 2 license by the way, but this is another implementation issue that can be resolved once we’ve moved the big rocks:

To date, model producers have been releasing model parameters (i.e., weights and biases) using an open-source license, such as Apache 2.0 and MIT, even though model parameters are not compatible with OSS licenses. Since model
parameters are in fact data, producers should use an open data license, such as CDLA-Permissive-2.0. Although licenses designed for OSS are permissive and indemnify the developer from liability, open data licenses are better suited to data-specific considerations such as privacy, ethics, and data rights.

The MOF was introduced at the start of townhall 5 and referenced in virtually every one since, including the latest on Friday in which @stefano gave a good explanation of its relevance. It’s been relied upon heavily in the creation of the draft and especially the checklist, yet its own clear position on what is required in terms of openness and completeness seems to have been missed or ignored. Sure, it references “reproducibility”, but I’m not sure you can get to full transparency (to study) and extensibility (to modify) without picking it up on the way… maybe by providing the data but omitting pre-processing code?

To achieve full transparency, reproducibility, and extensibility, we argue that model producers must go beyond just releasing their model and the trained weights and biases, which is currently the norm. Instead, they should include all artifacts of their work, including datasets for training, validation, testing, and benchmarking , as well as detailed documentation, such as research papers, model cards, data cards, and any usage documentation.

This is not entirely unlike the faulty voting data issue in that it’s been used to make and/or justify the decision/s, but it’s like the quote isn’t in the citation given. There may well be differences between MOF Class I and what it means to be “Open Source”, and we need to find and refine them. If this is you, maybe you could ask the LF authors what they meant?