Friends,
As I’ve been researching many different efforts that claim various levels of Open Source AI, I’ve noticed a pattern around the types of training datasets used for a model.
This pattern has led me to a possible solution to the debates around the Open Source AI definition’s commitment to requiring Open Data.
Summary
This proposal is for an exception in the OSAI definition to not provide some or all of a training dataset under specific conditions:
when the dataset cannot reasonably be released for reasons outside of the model creator’s control and the model creators are otherwise acting with integrity in sourcing the data and toward Opening everything else, the release can claim/receive the “OSAI D-” designation to indicate the exception.
(I’m aware the actual terms for this would exist in licenses designed to be compliant to the OSAID.
For introducing this idea I am choosing a method of a D- designator that is grounded in the definition; these approaches can be improved, natch.)
Pattern to nuance
So far the discussion around this binary condition — with or without data — has also been binary.
My intention is to introduce a bit of nuance to show there seem to be five states rather than two — with one state existing in two quadrants because IP rights and provenance are obscured when kept closed (the Schroedinger’s cat of IP rights.)
To start I’ll define the pattern I’ve observed, then diagram the pattern onto an XY quadrant, and finally integrate the proposed solution into the diagram.
The four quadrants in the diagram are M, N, O, and P, which are further defined at the end.
Observed pattern with five states
This is the core of the observed pattern, with the state number and quadrant association in brackets like this: [#,Q]
- For every potentially OSAI-compliant model, the training dataset either is released or not.
- Is Released [1,N] — those who release the training dataset know they are acting with integrity to the existing Open Source definition (OSD), as well as having the rights to license the data.
- Is Not Released — those who do not release the training dataset fall into one of two groups:
- No IP Rights [2,P] — they would release the dataset if they could, but they do not have the rights to release the data in one or more jurisdictions.
They claim or appear to be acting with high integrity to overriding conditions or law, rather than selfishness. - Have IP Rights Without Integrity [3,M] — they have the rights to release the dataset but do not choose to do so, most likely for one of these reasons:
- Commercial or research competition.
- They cannot prove the provenance of the dataset to be sure they have the rights to release it.
- Variant — they know/believe the provenance may not/does not give them the rights to release the dataset, and they wish to mitigate that risk through obscurity.
- No IP Rights [2,P] — they would release the dataset if they could, but they do not have the rights to release the data in one or more jurisdictions.
- Non-OSAI models and systems are low integrity to the Open ecosystem overall, with two variants to fit into quadrants:
- No or Unclear IP Rights — [4,O] quadrant for fully-closed.
- Full IP Rights — [5,M] quadrant, choosing not to Openly license the software.
Here is the base diagram of for these quadrant definitions:
Organization relationship to data
Here is the full diagram with the proposed D-/D+ solution mapped to the quadrants:
Quadrants defined
The diagram restated means:
The “M” quadrant is where the model creators have the rights to license the software and full training data, but choose not to do so.
They may call properly licensed software and content “Open Source”, but their AI is closed and cannot be referred to as “Open Source AI”.
The “N” quadrant is where the model creators have the rights and integrity to release the full training data et al. With a proper license, they are OSAI D+ compliant.
The “O” quadrant is where the model creators have no rights or choose not to license the software or data, and this is the realm of all closed source AI.
The systems without proper licensing that misuse the term “Open Source” also have the option of properly licensing their software with an Open Source license as a way to stop misusing the term Open Source (and thus move to quadrant M.)
The “P” quadrant is where the model creators would release the data if they had the rights to it, but for (verifiably?) legitimate reasons cannot release the data.
If they are acting with high integrity toward the spirit and meaning of Open, and also properly license all assets except the training dataset, they are OSAI D- compliant.
NB: This diagram may be better as a matrix rather than a quadrant with XY ranges.
The ranges do not make direct sense where the conditions are binary.
However, I have an instinct there may be futher nuance that creates some level of range along the X or Y axis, and then the quadrant spurs that thinking.
Applying the proposed model
Two examples that can map into this proposed model are covered in detail in my forum post, An open call to test OpenVLA.
The Open X-Embodiment Dataset clearly fits into the N quadrant, with a properly licensed dataset.
For the OpenVLA model, there is an argument they are acting with high integrity but unable to release the dataset.
From the diligence put into the rest of the Opening work, it’s clear the model creators might release the dataset if they could.
It is eligible for the “D-” exception in the P quadrant.
It is arguably a great benefit to society for OpenVLA to advance via the effects of fully-Open Source AI development — regardless of access to the dataset, an Open Source ecosystem around this model could thrive.
In closing, a comparison
Would it be fair to compare the condition of “reasonable inability to release the dataset” to having an Open Source license for code that can only run on an IBM Zseries mainframe?
Nothing is certain, but isn’t there potential technical and cultural value to having that code as Open Source?
This proposal and the diagrams in editable SVG format are in this repo: