A Journey toward defining Open Source AI: presentation at Open Source Summit Europe

anon18632855 · October 1, 2024, 8:17pm

Right, practitioners are not typically modifying the data in its system of record, rather transforming (selecting, filtering, de-duplicating, segmenting, tokenising, balancing, normalising, etc.) the “source” before “compiling” it to produce a “binary” (model).

How? By voting? By relying on the MOF? Consensus? Coin toss? It doesn’t matter because you accept “data is essential for […] studying the system” — one of the four essential freedoms — so data must be required by OSAID.

It is the only form for making many/most modifications, so it must be the preferred form, or you’d be placing limits on the freedom to modify as well.

Instead, the data information and code requirements allow Open Source AI systems to be forked by third-party AI builders downstream using the same information as the original developers.

This is internally inconsistent: the “same information as the original developers” IS the data, not metadata (data about the data aka “data information”).

@Shamar nailed it in that “as long as the training and testing data are available to the public under the same terms that allowed their usage from the builders in the first place, we can still count the AI system as “Open Source AI”, since the 4 freedoms are still granted even if the builders cannot directly distribute them.”

These forks could include removing non-public or non-open data from the training dataset, in order to retrain a new Open Source AI system on fully public or open data.

You cannot remove non-public or non-open data from the training dataset — which would be a very common form of exercise of the freedom to modify — if you don’t have the training dataset.

Per @thesteve0 in Model Weights is not enough for Open Source AI, prove it with a “demonstration showing that both techniques produce the same model weights.”

The vote was misinterpreted and demands the data. So does the straw poll I’m running asking “What is the “preferred form” in which a practitioner would modify a model?” for which 100% of the votes are going to Data rather than Model.

Perhaps this was pre-posted before the developments of the last days, or was intended to reflect the state of affairs as at the Open Source Summit Europe on 16-18 September, but if not it feels like we’re going forward internally only to go backwards externally.