What requirements does data information place on derived works? This question was briefly brought up in a previous topic, but I am failing to find the post and apparently didn’t book mark it.
Specifically, imagine that I fine tune Mistral 7b in order to develop a system to be good at helping guide discussions in the open source community.
I release the training data where I can, and pointers to the data where I am concerned about legal implications.
I release the code I used for training, and the code for my resulting software system, say under the Apache 2.0 license.
I argue that I have given a preferred form of modification and met the conditions of draft 0.0.8’s text. (I have not looked at the checklist).
You may not be able to fully understand my base model, but if there is a model you like better, you can substitute that model and apply my training data to that model.
For a system like this, substituting the base model is a more natural approach to modification and understanding than digging into the base model.
I’ve certainly started AI projects and tried applying fine tuning data to multiple base models, and so that sort of substitution definitely works.
I think that if I used Llama2 and that I did flow down their use restrictions as required by their license, the result would be not open because it would not guarantee the freedom to use.
At least in the case where the derived work is released by someone other than Mistral–someone who is not in a position to cure the ways in which Mistral is not open–I argue that this derived work should be open.
Let’s take a look at the ways in which Mistral is not open:
-
Some of its code is not available under an open license. I think that’s a big deal in looking at whether a sufficiently skilled person could reproduce something substantially similar to Mistral. However, note that there is plenty of code under open licenses to train and infer with the Mistral weights in multiple forms. The issue with the specific code not being available is not that you can’t train a Mistral model, simply that your results might differ enough that you could not produce something substantially similar.
-
Data information. I think the big question is whether there is a data information break between a derived work and a base model. I argue that for a derived work, the answer should be that providing a pointer to a base model and enough specificity to retrain the derived work should be enough to meet the data information requirement.