Can a derivative of non-open-source AI be considered Open Source AI?

An interesting question posted by @Danish_Contractor on the draft deserves wider views and discussion. I’m reposting it below:

Another question that is not sufficiently addressed – If someone instruction fine-tunes a base model (no training data disclosed) but discloses sources of data for instruction-tuning; will that model ever be considered a open-source AI model.

The answer I think is a no but it should be discussed in the context of this proposed standard.

This is a very interesting question and one for which I think can lead to a lot of bikeshedding if we discuss in the abstract. I would have assumed that finetuning a model without what draft 0.0.8 calls Data information is not ideal.

I’d like to reason around specific examples where something like this has happened…
To address the concern raised by Danish, I’d like to hear from experts:

  1. To what extent is technically possible to fine tune a model without knowing anything about how the system was initially trained? If that’s totally possible, then
  2. Can we look at an existing AI/ML system where little is known about the data and techniques used for training but that have been fine-tuned successfully with full disclosure of the data and techniques used to fine-tune? Hopefully that will help us address the issue concretely.

There are plenty of working Mistral finetunes, despite the fact that not very much is known about the training of Mistral
Additionally a finetune can be very slight, only injecting a small amount of information in the model.

Regarding point 1, I’m not sure to what extent fine-tuning is possible. However, I do believe a considerable amount of adjustment is possible. It would be better to ask AI vendors about this.

My understanding is that in fine-tuning, the weights of the newly added layers with trainable weights are updated along with the weights of the layers of the pre-trained model. However, not all layers of the pre-trained model are updated. If all layers were updated, it would negate the purpose of having a pre-trained model in the first place.

I’m not sure about this.
Multiple strategies for fine-tuning exist, and anyone is free to make up their own.

You do not need to add any new layer, you can update weights in any amount of existing layers.

You don’t have to update the whole network, but if you choose to that doesn’t negate the purpose of having a pre-trained model.
Training a new model from scratch can require a lot of data and time (depending on the size of the model, the training algorithm and other factors). Fine-tuning might be cheaper, even if you train all layers, because you are leveraging the information the model already contains.

In essence, you can use fine-tuning to slightly bias the model so that it performs better in a certain class of samples, or to slightly change its general behavior.
If you are training all layers, it’s similar to having that new data at the end of the overall training process, so that more information about it is preserved.

One thing I’d like to point out, which doesn’t really relate to @Danish_Contractor’s quote, but does relate to the title of this thread, is that the answer to the question is obviously “no” when it comes to licensing as the reason for the model not being open source, if we assume that trained models are copyrightable works of authorship.

It’s not an assumption that I think is at all reasonable (not under the law as is, nor under the law as I wish it to be), but it’s one we should probably go by when handling licensing.