The ZOOOM approach on Asset Openess

gvlx · October 6, 2024, 3:21pm

Be careful, the text needs to be read critically as some of the passages, some of which you have highlighted, are not really correct.

For instance:

Object code represents the complete translation of source code into an executable binary. [p.9]

This is incorrect as modern compilers implement optimization techniques which could ignore parts of the source code, not including them in the object code when they are deemed of no use in the target architecture, breaking the direct correspondence between source and object, which the text seems to indicate but does not details:

Object code is an executable form of source code that has been optimised by the compiler and linker in a way that allows it to run on a particular computer architecture. [p.9]

As for your conclusion about the text on page 7 on the CDSM Directive, you should have written

to remind us why datasets, in some cases, cannot be shared

as not all TDM practices will fall under the CDSM or similar legislation, furthermore, opposing the notion that data is not an integral part of a description of a model, the authors even indicate that:

However, if we accept that there is indeed a symbiotic relationship between some forms of memorisation and generalisation, then we can argue that memorisation is essential to the purpose of text and data mining, which is to produce a working model that can generate information. [p.8]

The part of the text I find crucial to our discussion is found on their final arguments:

We suggest that the principles of transparency, enablement and reproducibility are taken as the key concepts to unlock the beneficial effects for open source for AI. [p.14]

The first two principles of transparency and enablement are easily transposed on to the Free Software’s Freedoms which are stated in the current version of the OSAID, which offer, as a consequence:

transparency: code and data required to train and use the model is available to be copied, used and study
enablement: code and data required to train and use the model can be modified, and shared in its original form or modified

The reproducibility principle appears as of consequence of freedom 0, “The freedom to run the program as you wish, for any purpose”: there must be a guarantee the compiled executable delivers exactly what the delivered executable does, so as to be certain no extra (not free) feature was added.

This is no minor detail: there are known cases of delivered executable with features (or anti-features) that are not present on what was deemed to be the complete Open Source code of that binary.

The reproducibility principle is also being praised as one of the main advantages of Open Source and has become a cornerstone of current best practices on security and research.