Training data access

lumin · February 26, 2024, 5:09am

Isn’t the “privacy-preserving” techniques contradictory to open-source at the first place? You might think it is safe to “open-source” a federated learning based on private face data, but I guarantee you there are lots of methods to dump private training data from the model, although not byte-to-byte equivalent.
“access to data”. Anyone is able to download the original training/validation datasets anonymously, without charge, without registration. An example is the coco dataset: COCO - Common Objects in Context

zack · March 2, 2024, 1:13pm

I concur. More generally, it is entirely possible that there exist use cases / contexts where “open source AI” is just not a good fit. That is why we should not attempt to design our definition around every possible use cases, but rather make it descend from first principles (like the 4 freedoms). Which is what we are doing… except when we add to the debate arguments like “what about this specific context where we cannot open up data”.

fontana · March 3, 2024, 4:31am

Good example. It looks like the dataset as such is licensed under CC-BY, but the individual images making up the dataset, which we have to assume are copyrighted, are not in general openly licensed or perhaps not licensed at all.

Aspie96 · March 4, 2024, 3:52am

What about the massive amounts of existing finetunes of neural networks originally trained on data under proprietary licenses or even undisclosed data?
Many (rather good, I must say) models have been based on Mistral 7B. It seems to me that Mistral has encouraged improvements over the original pre-trained neural network.

You evaluate improvements over your evaluation dataset, not your training dataset.

Not unless you already include your conclusion in the “first place”.

No, that would depend on the kind of model. A model can be rather small and trained so as to not be able to regurgitate the original data.

Anyone able to download doesn’t mean much if a piece of software that uses the model is distributed offline. If I give someone a copy of Blender on a USB stick, do I also give them the whole dataset used to train Open Image Denoise?

lumin · March 6, 2024, 1:13am

Encouraging improvements over unreproducible (due to lack of original training data) base model is fine. But the fact that the base model is still fully controlled by its original author remains unchanged.

Oh come on. The AI is not only about large language models. You seem to have been mislead by the current LLM development trend where it has already diverged from the rigorous academia side. Both side have their own advantages.

If you are a researcher, you will appreciate the opportunity to reproduce other’s work independently, as well as the possibility to improve it. Without the original training dataset, nobody else is able to reproduce the base model itself other than Mistral author.

The requirement can be adjusted further – I agree.

lumin · March 6, 2024, 1:15am

We need non-language examples. I worry if people draw an equal sign between AI and large language model. AI is not only about language. It also includes computer vision and many other areas.

lumin · March 6, 2024, 1:17am

A direct example. You want to improve ResNet-50 for image classification. How do you make sure you can conduct controlled experiments without the ImageNet dataset? The original training dataset must be accessible to make sense.

lumin · March 6, 2024, 1:18am

The large language model ecosystem is a mess. Don’t be mislead by the mess. I suggest people here also learn a little bit more about AI apart from large language models.

fontana · March 6, 2024, 1:21am

It occurs to me that it would be quite problematic if “open access” were determined entirely by the risk tolerance of a distributor of third-party copyrighted material. One entity might release a model along with a training dataset that includes a lot of photos from Flickr, say, perhaps deciding to ignore the potential claims of authors of the photos or the licensing terms applied by those authors. Another entity might use the same training dataset and believe its use of the data in training is fair use, but might refrain from publishing the dataset because of concerns that in that context the dataset would be infringing. Why should the first entity benefit from the perception of having enabled “open access”?

lumin · March 6, 2024, 1:48am

The “open access” middle ground is a complicated trade-off. Requiring the data to be fully open-source (e.g. CC-BY-SA such as Wikipedia dump) is not practical to a large portion of models wide spread over the AI user communities – that will make probably 99% existing models uncompliant.

On the contrary, if we remove the “open access” requirement and allow model authors to hide the original training dataset, we lose all spirit and advantages of open source / free software – which would be a historical mistake.

I personally believe the solution lies somewhere near “open access”. But indeed, it is obscure in the case @fontana mentioned.

A direct example is ResNet-50 [1]. It is used by the whole computer vision community, and the pre-trained models are already widely spread over various academic or commercial projects. But the ImageNet is not accessible as anonymous, and is academic-only ImageNet . People can only download it after applying from its website, or download it from kaggle after signing the agreements.

But as a researcher, ResNet-50 on ImageNet is very easy to reproduce, inspect, study, etc. I don’t yet have a good idea on what kind of phrasing can well consider those complications.

[1] A pre-trained resnet can be found here: Models and pre-trained weights — Torchvision 0.17 documentation

Aspie96 · March 6, 2024, 4:12am

To make it truly reproducible you would need to share seeds and make sure training is consistent, otherwise you wouldn’t be able to share the result of a lucky run as open source.
I haven’t seen anyone suggesting the training code must take that into account.

If I evaluate the performance of a non-AI program on some data and I don’t share that data, does that mean that that program isn’t open source?

I don’t. How would you possibly adjust it?

lumin · March 6, 2024, 5:05am

That level of reproducibility (byte-to-byte) is extremely difficult with CUDA, even with the identical random number generators. Different generations of GPUs will give you different results due to the hardware float point implementation and the instruction sets. But the model performance should be close to the reference baseline even with different GPU models.

The whole machine learning academia appreciates open source including fully open training code for reproducing the original work.

Note, you are moving the subject to “non-AI” program. We are talking about OSAID, let’s focus on AI programs.

I don’t want to ask people to include training data to be included in the media for blender distribution. The dataset should be separately accessible. I don’t think I specified that the training dataset has to be distributed together with the software part. Does “separately” resolve your concern about the blender example?

Aspie96 · March 6, 2024, 5:08am

Why would the standard of openness be different?
If the standard is different, then maybe it’s not “open source” that we’re looking for. Maybe there needs to be a different, higher standard which isn’t related to “open source” and maybe not even managed by OSI, which is what I already suggested.

Extending our ability to define “open source” shouldn’t shift the level of freedom required in different contexts.

lumin · March 6, 2024, 5:12am

Assuming “XXX applies to A” implies “XXX applies to non-A” is a logical error. I don’t think the non-AI example can characterize the AI example.

If everything works for non-AI program also works for AI programs, why do we even work on the OSAID? Why don’t we just use the open source definition? why are we wasting time here??? That’s because AI programs are so different from non-AI programs, and caused lots of troubles.

Isn’t it (extending non-AI to AI) a logical error?

Aspie96 · March 6, 2024, 5:14am

Surely the meaning of “open source” in and out of AI must be as close as possible.
Otherwise what is this standard even based on? What is even the relationship between the two? Why even call two different standards with the same name?

lumin · March 6, 2024, 5:16am

Hmmm. That reminds me of the boundary issue between “OSD” and “OSAID”. Let me create another thread for this issue.

Here it is Compatibility/Boundary between OSD/OSAID

fontana · March 6, 2024, 5:20am

Indeed, I have been wondering this. It seems to me the OSI could say the OSD applies to machine learning models, including releases of model weights, and then articulate a position on how OSD 2 (availability of source code) relates to models.

fontana · March 6, 2024, 5:28am

Indeed. I think a lot of the difficulty I have been having with these discussions is, at root, the fact that the OSI is proposing to reuse the term “open source” but (it seems) is at least contemplating (i) a standard of disclosure (the source code concept) that is materially higher than is assumed to apply with open source software and (possibly) (ii) a set of licensing norms that is less strict than what the OSI has applied to open source software (e.g. defining “open access” data to include data under non-libre terms or non-licensed data). If the OSI were instead to pick a somewhat new term – “open models”, “open AI systems”, preferably something without the word “open” at all – I would have much less trouble with this.

stefano · March 6, 2024, 10:37am

Lots to unpack here

Uhm … The very first draft (v. 0.0.1, unreleased publicly) used the term machine learning but nobody in the private working group liked it. There was strong push back to rename it Open Source AI. The main reason was that everybody was already using that term. OSI is simply using the term already popular.

OSI is not contemplating anything, yet. This is a public consultation, the OSI board is not formally involved. There are board members and OSI staff presiding and facilitating the process that is 100% community driven. Look at the composition of the working groups, for example. We’re striving to have the widest possible diversity on all dimensions.

And I want to be clear: OSI is not writing this definition. Following OSI’s mission, we’re convening a conversation to find out what the communities think Open Source AI is. We noticed three years ago that we’re lacking a shared understanding of the AI space and we’ve decided to facilitate a process to learn collectively what these “new things” are.

OSI has no power to impose new terms on the industry. The term Open Source was applied to free software, overlapping 1-1 and came out of the gate with strong support by a large set of stakeholders. The term Open Source took on because there was a need for it and wide international support.

If there was a strong push to use another term, I’d be very happy but we have to deal with reality. Open Source AI is the term nobody likes that everybody is already using, including legislators in the EU. We simply have to deal with it.

fontana · March 6, 2024, 3:01pm

Let me explain what concerns me about this. Suppose a majority of the people you have classified as “stakeholders”, or maybe even just the most vocal or influential ones, decide “open source AI” means, for example, that licenses of components of an AI system can prohibit distribution. Does that mean the OSI abandons its historical principles and embraces a definition of “Open Source AI” that legitimizes restrictions on distribution? This is a serious question - I am really not sure how much energy to invest in this effort and if there is any possibility that the OSI will completely ignore what “open source” historically meant at least at a high level, I don’t want to participate. I will work on my own definition of “open source AI” instead and perhaps find some likeminded collaborators.

In this regard, I note that the OSAID drafts use language that is based on the Free Software Definition and do not refer to the OSD. That might be sensible (when I was on the OSI Board I even called for replacing the OSD with the FSD but given the behavior in this space, I have wondered whether this gives the OSI room to reject its historical commitment against licenses with use restrictions, since one difference between the OSD and the FSD is that the OSD borrows from the DFSG in explicitly ruling out licenses that discriminate against fields of use and fields of endeavor. As interpreted by the FSF, the FSD does as well, but this is not explicit in the FSD.

And I have to ask, who is a stakeholder? I’m not used to that term being used in discussions of FLOSS-related policy. I have no particular reason to think that I am a stakeholder, despite the great importance I personally attach to this topic. It looks to me like some of the people you consider stakeholders are representatives of companies, and individual machine learning practitioners I suppose, that have been misusing or misappropriating the term “open source” in the AI context. In the non-AI software context, would the OSI consider perpetrators of “open-washing” to be stakeholders of the Open Source Definition?