The ZOOOM approach on Asset Openess

Hello,

I would to take a small deviation from the current discussions to show you another framework of understanding on the issues at hand: Zooom4U

3Os and IP awareness raising for collaborative ecosystems

Оpen source software (OSS), open (source) hardware (OH/OSH) and open data (OD) are essential for a sustainable, trustworthy and sovereign industrial ecosystem. However, lack of competences in matching business models with appropriate licensing frameworks prevents unlocking the full potential of emerging technologies.

ZOOOM aims to raise awareness on the importance of intellectual property (IP) generation and management in collaborative innovation ecosystems which rely on these three key assets. Stay in the loop on latest updates, events and project outcomes!

Of particular interest to the OSAID is the following interactive information: AI as an hybrid asset which is based on their policy brief Open Source AI: Building Blocks for a Definition that goes into much more detail than our previous exchanges.

The authors (Emanuilov, Ivo & Suksi, Jutta) state the following:

In addition, we assess, how intellectual property law treats hybrid intellectual property. We offer three building blocks for a future definition of Open Source AI, namely transparency, enablement and reproducibility:

  • Transparency: disclosure of details about the composition of training data sets, details about the data structures, architecture and algorithms, access to neural network weights etc.
  • Enablement: disclosure of sufficient details about the building of a model to enable anyone to rebuild the model, provided they have access to the required computational resources, as identified by the community developing the AI.
  • Reproducibility: development practices that create an independently-verifiable path from the training data to model inference.

These three building blocks should unlock the opportunities of open source in the domain of AI and we expect that they would also facilitate comprehension of AI as protected and licensable subject matter.

Reading through all this information, I’m forced to reassess my previous writings and go into a much finer and strong stance on data openness requirements for OSAID.

But I’ll wait for your analysis.

2 Likes

Interesting paper, thanks for sharing it. It reinforces pretty much everything we’ve learned from this process so far (the authors of the paper participated to last year’s webinars, too.)

Valuable quote that confirms that the equivalence of source-binary cannot be ported to AI (data is not the “source” of AI). From page 9:

we argue that the case with the representation of a work as tokens and
vectors is different from compiled object code

Another valuable nugget, to remind us why datasets cannot be shared:

The CDSM Directive, which governs the text and data mining exception, provides that reproductions and extractions may only be retained for so long as necessary for the purposes of text and data mining.

Interesting also the answers given to the questions “Do I Need to Comply with the Licence Conditions When Using Open Source Code as Training Data?” and “Do I Need to Comply with the Licence Conditions When My Model Is Used?”

I read the conclusions as pretty much in line with to the one we reached:

obviously, enablement would depend on technical criteria developed by the communities around projects for following the principles of open source

Which is what we’re completing now with the Open Source AI Definition: we have version 1.0, we’ll have to watch the space and be humble to change the terms as we go. The recommendation of reproducibility is a bit weird, as that is a higher standard than classic Open Source software (I asked the authors for clarifications, I’ll share their response if they reply.)

Be careful, the text needs to be read critically as some of the passages, some of which you have highlighted, are not really correct.

For instance:

Object code represents the complete translation of source code into an executable binary. [p.9]

This is incorrect as modern compilers implement optimization techniques which could ignore parts of the source code, not including them in the object code when they are deemed of no use in the target architecture, breaking the direct correspondence between source and object, which the text seems to indicate but does not details:

Object code is an executable form of source code that has been optimised by the compiler and linker in a way that allows it to run on a particular computer architecture. [p.9]


As for your conclusion about the text on page 7 on the CDSM Directive, you should have written

to remind us why datasets, in some cases, cannot be shared

as not all TDM practices will fall under the CDSM or similar legislation, furthermore, opposing the notion that data is not an integral part of a description of a model, the authors even indicate that:

However, if we accept that there is indeed a symbiotic relationship between some forms of memorisation and generalisation, then we can argue that memorisation is essential to the purpose of text and data mining, which is to produce a working model that can generate information. [p.8]


The part of the text I find crucial to our discussion is found on their final arguments:

We suggest that the principles of transparency, enablement and reproducibility are taken as the key concepts to unlock the beneficial effects for open source for AI. [p.14]

The first two principles of transparency and enablement are easily transposed on to the Free Software’s Freedoms which are stated in the current version of the OSAID, which offer, as a consequence:

  • transparency: code and data required to train and use the model is available to be copied, used and study
  • enablement: code and data required to train and use the model can be modified, and shared in its original form or modified

The reproducibility principle appears as of consequence of freedom 0, “The freedom to run the program as you wish, for any purpose”: there must be a guarantee the compiled executable delivers exactly what the delivered executable does, so as to be certain no extra (not free) feature was added.

This is no minor detail: there are known cases of delivered executable with features (or anti-features) that are not present on what was deemed to be the complete Open Source code of that binary.

The reproducibility principle is also being praised as one of the main advantages of Open Source and has become a cornerstone of current best practices on security and research.

2 Likes

Indeed, and even for software it’s not a solved problem. Open Source is not an impediment of reproducible builds, but it’s not its main objective.

I didn’t take the word “reproducibility” in its scientific meaning, because there is scientific debate of whether that’s even possible to achieve.

As you noticed, the authors have made concessions about the role of modern compilers so their text needs to be read knowing that they’re experts of European copyright speaking to EU regulators (it’s a policy brief).

I suspect the authors used the word “reproducibility” to make a general point (just like they did with the equivalence source/object code). I’m reading it as what the OSI FAQ calls the “right to fork”: give third parties the abilities (tools) and rights (legal terms) to learn how an AI has been built and build a similar one without having to start from scratch.