Should testing data be made available to *use* an AI system?

One area of disagreement that emerged in this week’s Llama 2 work group meeting (slides forthcoming) is whether testing data sets are technically required to use an AI system.
What do you think? Is it necessary to make testing data sets available to exercise the freedom to use AI? What does available mean to you in this case?

This is an odd question, especially absent the context. I’m going to answer a different question first (at least from my perspective):

Q: Is source code necessary to use an open source binary?
A: No, assuming that the binary is exactly compatible with the system that you wish to run it on.

Now, that makes two big assumptions:

  1. The source code in question is used to generate the open source binary.
  2. The source code in question is not somehow read/used/required by the open source binary as part of its normal functionality.

1 is more of a semantic concern, but I raise it because it will be relevant when we extrapolate this to your AI question.
2 seems silly at first, most programs do not read/use/require their source code to function, but some do. Interpreted languages are an example, but many programs require “data” files to be read in to work.

Accordingly, lets reframe the question as this:

Q. Are data files which are read/loaded/used for specific functionality during the execution of an open source binary necessary to use that open source binary?
A. Yes, if you want that specific functionality to work.

We get a different answer with this additional nuance, because we’re not thinking about source code that is only used once, but instead, thinking about data files which are used every time specific functionality from the binary is accessed.

Testing data is still different though. I would argue that most programs only use testing data to verify correct output from a program, and that the programs do not directly depend on the testing data to enable specific functionality.

So, we reframe the question once again:

Q. Are testing data files, which are only used for testing that an open source binary operates properly, necessary to use an open source binary?
A. No.

We’ve got a different answer!

Now, the binary might have bugs and it might be exceedingly difficult to identify those bugs without the testing data files, but the binary does what it does and the absence/presence of those testing data files does not impact functionality. A user can do all the things the binary is capable of without them.

Still with me? :laughing:

Back to your first question (paraphrased a bit):

Q: Are testing data sets technically required to use an AI system?

In the context of an AI system, we need to know if the testing data sets are:

A. only used for testing the validity of the AI system
B. read/loaded/used for specific functionality during the execution of the AI system

If A, the answer is probably No. If B, the answer is probably Yes.

I can imagine that it is plausible that some AI systems fall under A, while others fall under B.

There is one additional factor, which is whether the “testing data sets” are necessary (and accordingly used) to generate the AI system (our very complex open source binary). If they are, there is an argument that “use” in an open source context includes “use” on your system, which can be different from the system that the upstream has/prefers/provides a binary for. Could be different in small ways (e.g. version of the C library or OS kernel), or bigger ways (e.g. CPU architecture, GPU type). Open source empowers the user to be able to recompile/regenerate something from its necessary component parts. But even this is philosophical, because I can open source code that does not compile (either due to bugs or incomplete implementation). That bad code is not useful, but it is open source, by the existing definition. The trick here is that if I also distribute a binary which (because it exists) did compile from source at least once, “open source” means that I need to make the corresponding source code that made that binary available. The OSD Clause 2 says “The program must include source code…”, and I would argue that it implicitly means “The program must include the source code which corresponds exactly to the program…”

I realize this is lengthy and a bit windy, but without deeper context, it’s the best I can do. I do think we need to be cautious not to try to break this challenge of defining what an Open Source AI system ™ is down into overly simplistic concepts that may not be universally true (or false).

We should ask:

  1. Is this “component thing” necessary to generate/regenerate the AI system? Can it be “rebuilt from source” without it and get the same functionality as the existing AI system?
  2. Is this “component thing” necessary for core runtime functionality of the AI system? Does it fail to run in normal scenarios without it?

It is arguable that if the answer to both 1 and 2 is no, then the availability of “component thing” is not necessary to exercise the freedom to use the AI system.

I would also argue that if the answer to 1 is no, and 2 is YES, the availability of “component thing” is necessary to exercise the freedom to use the AI system. A fine nuance here is that it is still possible that some of the sub-components of the larger AI system could be open source, even if the larger AI system would not be considered so.

And for completeness, I would argue that if the answer to 1 is YES, then the availability of “component thing” is necessary to exercise the freedom to use the AI system, because my runtime environment may be different, and I would need to recompile/regenerate the AI system to make it run for me.

I hope this is helpful!

1 Like

Fair enough :slight_smile: the context is visible in the list of components published in the checklist section of draft 0.0.5.. The question came up while setting up the working group doing the analysis of Llama2. I thought it’d be valuable to share it here for a wider conversation.

If I followed your argument correctly, you’re writing about testing, validating, certifying and trusting (to some extent). Consider that the verb to Use in the question is referred to the definition of AI system, it means (following OECD definition) letting the machine infer an output from an input.

This suggests that we can’t answer the question I asked in generic terms because there are too many “it depends”, the scenarios are not clear enough. I think we’re not ready to talk about what’s needed to validate, replicate, trust and the other important issues at a later stage.

I’d say really no.

First, an AI program may be tested manually to some extent. Is everything the developer tries to be in the included set?

Also, this requirement doesn’t exist for non-AI open source software at all. If you test your program on some data, you don’t need to share that data for your program to be open source.

In fact, I’d argue you don’t even have to share things such as unit tests. You should, but you don’t have to and if you don’t what you do share is still open source.

2 Likes

Agree.

May be “appreciated”, but absolutely not required.

1 Like

@stefano I recommend we follow RFC 2119 when using words like “MUST”, “SHOULD”, or “MAY”.

From my understanding from this thread, people agree that the testing data SHOULD be made available to use, but not MUST.

1 Like

@nick, I disagree.

The documenti in question is a definition. The purpose should simply be to discriminate between a license or a model which is open source and one which isn’t.

What “SHOULD” happen, both in the common sense of the word and in the RFC 2119 sense of the world (which are rather well aligned, IMO) is an important topic, but I really think it doesn’t belong in this definition.

2 Likes

I disagree, too.
It would expand the meaning of the traditional term Open Source.

2 Likes

Testing data is not usually coupled with an AI model. For instance, if we have an LLM, we can use multiple benchmarking datasets to test its performance. These benchmark datasets are not coupled with the training process of a model. Besides, some test datasets do not publish their labels to prevent participants from cheating, such as COCO dataset.

Apart from the training and testing, the validation dataset is indeed coupled with an AI model because, in convention, the model hyper-parameters are tuned on them.

So I think it is hard to make sense to require a test dataset, because

  1. There is not a fixed test dataset for a given AI model.
  2. The test dataset is not coupled with the model creating process – it is independent to the model training process. If not, the model trainer has made a serious technical error.