Can a derivative of non-open-source AI be considered Open Source AI?

An interesting question posted by @Danish_Contractor on the draft deserves wider views and discussion. I’m reposting it below:

Another question that is not sufficiently addressed – If someone instruction fine-tunes a base model (no training data disclosed) but discloses sources of data for instruction-tuning; will that model ever be considered a open-source AI model.

The answer I think is a no but it should be discussed in the context of this proposed standard.

This is a very interesting question and one for which I think can lead to a lot of bikeshedding if we discuss in the abstract. I would have assumed that finetuning a model without what draft 0.0.8 calls Data information is not ideal.

I’d like to reason around specific examples where something like this has happened…
To address the concern raised by Danish, I’d like to hear from experts:

  1. To what extent is technically possible to fine tune a model without knowing anything about how the system was initially trained? If that’s totally possible, then
  2. Can we look at an existing AI/ML system where little is known about the data and techniques used for training but that have been fine-tuned successfully with full disclosure of the data and techniques used to fine-tune? Hopefully that will help us address the issue concretely.

There are plenty of working Mistral finetunes, despite the fact that not very much is known about the training of Mistral
Additionally a finetune can be very slight, only injecting a small amount of information in the model.

Regarding point 1, I’m not sure to what extent fine-tuning is possible. However, I do believe a considerable amount of adjustment is possible. It would be better to ask AI vendors about this.

My understanding is that in fine-tuning, the weights of the newly added layers with trainable weights are updated along with the weights of the layers of the pre-trained model. However, not all layers of the pre-trained model are updated. If all layers were updated, it would negate the purpose of having a pre-trained model in the first place.

I’m not sure about this.
Multiple strategies for fine-tuning exist, and anyone is free to make up their own.

You do not need to add any new layer, you can update weights in any amount of existing layers.

You don’t have to update the whole network, but if you choose to that doesn’t negate the purpose of having a pre-trained model.
Training a new model from scratch can require a lot of data and time (depending on the size of the model, the training algorithm and other factors). Fine-tuning might be cheaper, even if you train all layers, because you are leveraging the information the model already contains.

In essence, you can use fine-tuning to slightly bias the model so that it performs better in a certain class of samples, or to slightly change its general behavior.
If you are training all layers, it’s similar to having that new data at the end of the overall training process, so that more information about it is preserved.

One thing I’d like to point out, which doesn’t really relate to @Danish_Contractor’s quote, but does relate to the title of this thread, is that the answer to the question is obviously “no” when it comes to licensing as the reason for the model not being open source, if we assume that trained models are copyrightable works of authorship.

It’s not an assumption that I think is at all reasonable (not under the law as is, nor under the law as I wish it to be), but it’s one we should probably go by when handling licensing.

I asked a machine learning expert about fine-tuning and understood that there are various approaches to it. It can add new layers or update existing ones. However, when updating existing layers, it seems to me that it relies on the information from the original layers. In Japan, it is certain that this does not constitute a derivative of intellectual property rights, but I am not sure about other jurisdictions.

Yes, exactly, otherwise it wouldn’t be a finetune (or, at least, there would be no point to finetuning).

You may be finetuning only a few of the layers (even just one) or the whole network (potentially modifying every weight), but a lot of the information the network learned is preserved. Some of it may also get lost in the process.

The finetune can be extremely slight. It may modify the model only in a very marginal way.
If that’s enough to create a model that is independent, copyright-wise, from the original, then licensing of models doesn’t matter at all. We could just add a tiny bit of fine-tune to a proprietary model.
If what you are implying, however, is that ML models in Japan already don’t have any copyright on them (so modifying them wouldn’t be a derivative work, because it’s not a “work” to begin with), then copyright licenses are also irrelevant (but I guess contracts may be relevant).

Yes, in Japan, it has been concluded that there is no copyright (including database copyright) for model parameters and weights. The only way to protect them is through contracts.

To clarify and prevent any misunderstandings:

In Japan, there has been no definitive conclusion on whether common open source licenses are considered contracts. About 20 years ago, a government agency issued an ambiguous conclusion that “the GPL may contain contractual elements,” and since then, it has become customary in the software business world to treat open source licenses as contracts.

Even if AI model parameters do not have copyright protection, the application of an OSI-approved license suggests the possibility of contractual protection. In practice, I believe the overwhelming majority of businesspeople currently interpret it this way.

Somewhat late to this thread but let me add specific examples in response to Stefano’s question:

  1. To what extent is technically possible to fine tune a model without knowing anything about how the system was initially trained? If that’s totally possible, then
  2. Can we look at an existing AI/ML system where little is known about the data and techniques used for training but that have been fine-tuned successfully with full disclosure of the data and techniques used to fine-tune? Hopefully that will help us address the issue concretely.

There are plenty of Meta and Mistral fine-tunes that work well (at least by some measures) and are fairly open about the fine-tuning data and procedures, as @Aspie96 also notes. Yet they will always be hampered by the non-disclosure of the training data and sometimes fine-tuning data of the initial Meta and Mistral systems. Our overview has about 20 such ‘clones’, where LLM base is a closed (at best open-weights) model like Llama 2/3 or Mistral 7B.

A pretty good recent example is Intel’s Neural 7B: they take Mistral 7B and fine-tune it using Orca data. Their github repo is quite helpful about the fine-tuning steps and about the data curation for fine-tuning.

Another good example is AllenAI’s Tulu 70B, which takes Llama 2 70B and fine-tune it using Tulu SFT and UltraFeedback datasets. Their OpenInstruct github repo is a model of clarity.

However, since both models opt to work with notoriously closed base models (Mistral 7B and Llama 2 70B respectively), I don’t think this such efforts could ever qualify as a truly open generative AI systems. This would be my answer to @Danish_Contractor 's original query. For instance, downstream users of this model cannot exclude possible copyright violations in the original pre-training data which may create legal liabilities (of the kind shown in the NYT <> OpenAI lawsuits), and end users will never be able to open up the black box of the initial pretrained model to better understand its behaviour.

1 Like

I think this issue is clearly described, thanks everyone for sharing your knowledge: the lack of transparency about the data used in the original model cannot be compensated by transparency downstream. Derivatives of non-Open Source AI cannot be Open Source AI.

I don’t think we should be concerned about theoretical copyright violations when there are very concrete issues with lack of transparency on data.

The copyright issues may not exist in the US, and may not exist in other parts of the world, like Japan.

On the other hand, it’s very likely that opaquely trained models contain privacy issues, biases, harmful content, possibly backdoors and other security issues. Tactically, I think it’s a bad move to lead with copyright issues, it’s like playing with fire as expanding the scope of copyright is very likely to backfire on the open movement.

3 Likes

I will talk about my experience fine-tuning a “small” model (200M params) for building a feature in Firefox for describing images (Experimenting with local alt text generation in Firefox Nightly - Mozilla Hacks - the Web developer blog)

For image-to-text models like that, it’s standard practice to use a base model for both the image encoding and the text decoding, and fine tune them – so we don’t start from scratch.

I picked:

  • a VIT model for the image encoder: google/vit-base-patch16-224-in21k
  • a Distilled GPT-2 model for the text decoder: distilbert/distilgpt2

One issue we had when we trained on COCO and Flickr30k was the annotator biases.
I worked on de-biasing the dataset (see distilvit/docs/fighting_bias.md at main · mozilla/distilvit · GitHub) and retrained it.

For example, one thing that was changed was to keep the description gender-neutral. That new round of fine tuning was very effective. And with only 30k labels, we have yet to find any case of gendered description, and we tested manually thousands of images, including adversarial ones.

What I am getting at is that if a project is very intentional in how it fine tunes base models, even if they don’t have 100% tracability of the data used downstream, and if the whole process is transparent and published as open source, I would consider the new fine tuned model as open source, and anything that would surface from the downstream data as fixable issue by the project maintainers.

Cheers
Tarek

3 Likes