quaid’s proposal here makes sense to me and addresses my biggest concerns with RC1.
The points in the How we passed the AI conundrums article seem to miss a key issue with RC1, which is that it doesn’t appear to make any meaningful extension to what is possible with existing open source licenses.
It seems we are aligned that there are 4 elements that comprise the entirety of an AI:
- Training data
- Data information - documentation on training data selection, pre-processing etc
- Code
- Parameters - model weights, configuration settings etc
It seems that:
- Data information and code are already easily covered by established open software licenses.
- RC1 only requires data information, code and model weights.
- Therefore the only new element that is included in the RC1 definition in parameters - primarily model weights.
What is the purpose of releasing model weights without training data?
From the article the suggestion is that (in combination with data information and code) this would let users train a new model using their own private data.
For any real-world usage (like the bone cancer example), it’s hard to imagine anyone using model weights from a third party without any external trust relationship (e.g. a paid contract). The weights have exactly the same transparency, trust and security issues as downloading and executing binary blobs built by individuals you don’t know and just running them.
So it’s hard to come up with a purpose for including the weights (without training data). Any responsible user who wants to use them for anything meaningful should be retraining the model using their own training data. The only example I can come up with is a sandboxed evaluation tool - but it’s still totally unclear what the benefit/meaning of this blob of data being “open source” is.
If we follow the RC1 logic to it’s conclusion then - model weights clearly aren’t “source”, and (even in in combination with data information and code) aren’t an artifact that could be responsibly put into usage.
So why are model weights even included in the definition?
I think the answer to this is that if they are not included, the only thing that is left is data information and code and it’s clear these are already easily covered by established open software licenses. This makes the whole purpose of needing an Open Source AI definition moot - it simply wouldn’t be needed.
** Conclusion **
This is the situation it seem RC1 is in now - it has expanded the definition beyond what is already possible to have some reason to exist, but the primary artifact it adds doesn’t seem to add practical value or meaning.
So to come full circle - if we want to have a meaningful Open Source AI definition it needs to include open training data. Yes, this might mean that models that inherently rely on private data can’t be “Open Source AI”, but they can still have open source data information and code, which are the primary artifacts that has value to others.
As a reminder, for it’s first decade (or two) the whole idea of FOSS was considered a completely impractical niche that would never take off by the vast majority of serious software users. Yet nowadays it is present (to varying degrees) in pretty much every computer system and website.
I think the right choice here is to take a bold step with a meaningful definition, knowing that over time it has potential to develop into something much more valuable to humanity.