Should testing data be made available to *use* an AI system?

One area of disagreement that emerged in this week’s Llama 2 work group meeting (slides forthcoming) is whether testing data sets are technically required to use an AI system.
What do you think? Is it necessary to make testing data sets available to exercise the freedom to use AI? What does available mean to you in this case?

This is an odd question, especially absent the context. I’m going to answer a different question first (at least from my perspective):

Q: Is source code necessary to use an open source binary?
A: No, assuming that the binary is exactly compatible with the system that you wish to run it on.

Now, that makes two big assumptions:

  1. The source code in question is used to generate the open source binary.
  2. The source code in question is not somehow read/used/required by the open source binary as part of its normal functionality.

1 is more of a semantic concern, but I raise it because it will be relevant when we extrapolate this to your AI question.
2 seems silly at first, most programs do not read/use/require their source code to function, but some do. Interpreted languages are an example, but many programs require “data” files to be read in to work.

Accordingly, lets reframe the question as this:

Q. Are data files which are read/loaded/used for specific functionality during the execution of an open source binary necessary to use that open source binary?
A. Yes, if you want that specific functionality to work.

We get a different answer with this additional nuance, because we’re not thinking about source code that is only used once, but instead, thinking about data files which are used every time specific functionality from the binary is accessed.

Testing data is still different though. I would argue that most programs only use testing data to verify correct output from a program, and that the programs do not directly depend on the testing data to enable specific functionality.

So, we reframe the question once again:

Q. Are testing data files, which are only used for testing that an open source binary operates properly, necessary to use an open source binary?
A. No.

We’ve got a different answer!

Now, the binary might have bugs and it might be exceedingly difficult to identify those bugs without the testing data files, but the binary does what it does and the absence/presence of those testing data files does not impact functionality. A user can do all the things the binary is capable of without them.

Still with me? :laughing:

Back to your first question (paraphrased a bit):

Q: Are testing data sets technically required to use an AI system?

In the context of an AI system, we need to know if the testing data sets are:

A. only used for testing the validity of the AI system
B. read/loaded/used for specific functionality during the execution of the AI system

If A, the answer is probably No. If B, the answer is probably Yes.

I can imagine that it is plausible that some AI systems fall under A, while others fall under B.

There is one additional factor, which is whether the “testing data sets” are necessary (and accordingly used) to generate the AI system (our very complex open source binary). If they are, there is an argument that “use” in an open source context includes “use” on your system, which can be different from the system that the upstream has/prefers/provides a binary for. Could be different in small ways (e.g. version of the C library or OS kernel), or bigger ways (e.g. CPU architecture, GPU type). Open source empowers the user to be able to recompile/regenerate something from its necessary component parts. But even this is philosophical, because I can open source code that does not compile (either due to bugs or incomplete implementation). That bad code is not useful, but it is open source, by the existing definition. The trick here is that if I also distribute a binary which (because it exists) did compile from source at least once, “open source” means that I need to make the corresponding source code that made that binary available. The OSD Clause 2 says “The program must include source code…”, and I would argue that it implicitly means “The program must include the source code which corresponds exactly to the program…”

I realize this is lengthy and a bit windy, but without deeper context, it’s the best I can do. I do think we need to be cautious not to try to break this challenge of defining what an Open Source AI system ™ is down into overly simplistic concepts that may not be universally true (or false).

We should ask:

  1. Is this “component thing” necessary to generate/regenerate the AI system? Can it be “rebuilt from source” without it and get the same functionality as the existing AI system?
  2. Is this “component thing” necessary for core runtime functionality of the AI system? Does it fail to run in normal scenarios without it?

It is arguable that if the answer to both 1 and 2 is no, then the availability of “component thing” is not necessary to exercise the freedom to use the AI system.

I would also argue that if the answer to 1 is no, and 2 is YES, the availability of “component thing” is necessary to exercise the freedom to use the AI system. A fine nuance here is that it is still possible that some of the sub-components of the larger AI system could be open source, even if the larger AI system would not be considered so.

And for completeness, I would argue that if the answer to 1 is YES, then the availability of “component thing” is necessary to exercise the freedom to use the AI system, because my runtime environment may be different, and I would need to recompile/regenerate the AI system to make it run for me.

I hope this is helpful!

1 Like

Fair enough :slight_smile: the context is visible in the list of components published in the checklist section of draft 0.0.5.. The question came up while setting up the working group doing the analysis of Llama2. I thought it’d be valuable to share it here for a wider conversation.

If I followed your argument correctly, you’re writing about testing, validating, certifying and trusting (to some extent). Consider that the verb to Use in the question is referred to the definition of AI system, it means (following OECD definition) letting the machine infer an output from an input.

This suggests that we can’t answer the question I asked in generic terms because there are too many “it depends”, the scenarios are not clear enough. I think we’re not ready to talk about what’s needed to validate, replicate, trust and the other important issues at a later stage.

I’d say really no.

First, an AI program may be tested manually to some extent. Is everything the developer tries to be in the included set?

Also, this requirement doesn’t exist for non-AI open source software at all. If you test your program on some data, you don’t need to share that data for your program to be open source.

In fact, I’d argue you don’t even have to share things such as unit tests. You should, but you don’t have to and if you don’t what you do share is still open source.

2 Likes

Agree.

May be “appreciated”, but absolutely not required.

1 Like

@stefano I recommend we follow RFC 2119 when using words like “MUST”, “SHOULD”, or “MAY”.

From my understanding from this thread, people agree that the testing data SHOULD be made available to use, but not MUST.

1 Like

@nick, I disagree.

The documenti in question is a definition. The purpose should simply be to discriminate between a license or a model which is open source and one which isn’t.

What “SHOULD” happen, both in the common sense of the word and in the RFC 2119 sense of the world (which are rather well aligned, IMO) is an important topic, but I really think it doesn’t belong in this definition.

3 Likes

I disagree, too.
It would expand the meaning of the traditional term Open Source.

2 Likes

Testing data is not usually coupled with an AI model. For instance, if we have an LLM, we can use multiple benchmarking datasets to test its performance. These benchmark datasets are not coupled with the training process of a model. Besides, some test datasets do not publish their labels to prevent participants from cheating, such as COCO dataset.

Apart from the training and testing, the validation dataset is indeed coupled with an AI model because, in convention, the model hyper-parameters are tuned on them.

So I think it is hard to make sense to require a test dataset, because

  1. There is not a fixed test dataset for a given AI model.
  2. The test dataset is not coupled with the model creating process – it is independent to the model training process. If not, the model trainer has made a serious technical error.

(Reopening this topic, as it is becoming increasing relevant to current discussions Draft v.0.0.9 of the Open Source AI Definition is available for comments - #15 by Shamar )

I beg to disagree.

If you want to validate the claims of a particular Ai system (as you would if you have the freedom to study it), you will absolutely need all the original data it was used to test it.

It does not matter that you can create a new test dataset which could invalidate the original claims, you need to test the original conclusions against the original (train and test) data.

As with any other Open Source product, most people are only users; some will compile it; a few will study it and develop new features or correct bugs.

You need to have access to all parts to be able to exercise your four freedoms.

2 Likes

Clarification about the missing freedom from the latest comment:

If you want to change a particular Ai system (as you would if you have the freedom to modify it), you will absolutely need all the original data it was used to train it originally.

An Ai system maybe be modified by retraining the system (feeding new training data onto the model) but that is another type of modification.

Both types of modifications are required on an Open Source Ai system.

We digress but this is relevant to the discussions about data access in general: if someone can access the required data somewhere (e.g., in Japan) without legal/network/storage/etc. impediments, then my freedoms as an Open Source user — who wouldn’t know how to exploit the data directly anyway — are at least partially protected. Conversely, we need not set the impossible standard that all datasets be readily available to every user everywhere. Jurisdictions are already competing to host AI development (e.g., EU vs US).

In certain contexts, providing testing datasets would be required, OR would be required IF training data is not (i.e., testing data can occasionally be a substitute for training data). For instance:

  • Regulated Industries: In fields like healthcare, medical devices, or factory automation, it is necessary to demonstrate compliance with safety requirements and regulations. Testing datasets may provide the necessary transparency to show that the system behaves according to the manufacturer’s claims and passes necessary audits (i.e., produces compliant outputs given certain inputs).
  • Data Return Systems: If a system’s output includes returning verbatim data, like a virtual psychologist quoting the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) or a virtual doctor referencing a Prescriber’s Digital Reference (PDR) entry, the testing datasets may be required for verification of proper function and compliance. As another recent example, retrieving verbatim quotes from transcripts with an LLM is risky given hallucination (which is one of several reasons why I didn’t request attribution).
  • Data-Dependent Algorithms: Algorithms may need to process some or all of the “source” data at runtime rather than “learning” it in advance, such as k-means clustering in the AI context (per @spotaws’ point about scripting languages in the software context), meaning training datasets are integral to their operation and validation, but testing datasets are often a random subset of the same source anyway so if you can release one, you can release both, and:
  • Cross-Validation: Where training and testing datasets are selected from a superset “source” several times, (e.g., k-fold cross validation) then the entire dataset of both testing and training data is required.
  • Security Testing: Relevant test cases may be necessary (but not necessarily sufficient) to ensuring systems perform to spec. For example, in traditional Open Source, I can refer to the source code to verify that wordcount.exe does not also steal my crypto. For ML models, testing data can sometimes (but not always and/or not fully) play a similar role in validation and verification.

In some cases where the training data is unavailable, providing the testing data can be used to verify and evaluate the system’s performance. In others, testing data may even be more convenient or compliant due to its smaller size (a medical model could include only patients with the required waivers, for example).

Using testing datasets in lieu of training datasets can be exploited though, for example by:

  • Size: Releasing an unusably small training set merely to comply with the definition/checklist (e.g., one cat and one dog photo). It’s hard to conceive of universal language to prevent this.
  • Backdoors/Jailbreaks: Satisfying provided test cases while others trigger hidden behaviours, bypasses restrictions, releases sensitive information, or deploys a hidden payload.
  • Selective Omission: Crafting a testing dataset that exhibits atypically high levels of performance (e.g., precision/recall, time) not representative of real-world scenarios, for example by filtering out examples with high MSE scores or that take longer to process.
  • Overfitting: Releasing a testing dataset similar to or a mirror of the training dataset to mislead users as to the suitability of the system.
  • Oversimplification: Including only simplified examples in the testing dataset (e.g., clear, daytime images of roads without pedestrians for an autonomous vehicle system).
  • Obscuration: Providing irrelevant test data that checks the box while not giving an indication of real-world performance.
  • Versioning: Failing to update the testing data as the model evolves such that it doesn’t test newly introduced functionality.

In any case, if any context demands testing data then every context effectively does; the definition must not to discriminate against persons or fields of endeavour.

This is a very good idea, but to @Aspie96’s point, the only requirement levels that belong in a litmus test definition like the OSD or OSAID are “MUST” (or “REQUIRED” or “SHALL”) and “MUST NOT” (or “SHALL NOT”).

It could be useful for future revisions of the checklist, where “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” could impart additional meaning, but using any of these requirement levels effectively makes that term optional in the context of certification.

The word “appreciated” looked really out of place in earlier drafts!