Explaining the concept of Data information

hartmans · June 17, 2024, 9:21pm

I still think there’s a fairly deep divide about what the preferred form of modification is. Unfortunately, even more than with traditional software, it depends on what you want to do.
Some contributions to the discussion talked about fine tuning, transfer learning, and other techniques where the model weights actually are more of a preferred form of modification than the data.

As a software developer, retraining something like Mistral 7b is entirely outside of my budget or current capability.
If I want to exercise my freedom to use, to modify, or to share a LLM, the model weights are the form I need and want.
Modifying a model from its weights through techniques like pre-training is how I would go about modifying a model.

On the other hand, if I want to exercise my freedom to understand a model, then I start to care about the training data. In many ways data information is going to help me more with that understanding than the raw data. I am probably going to gain more information than a description of the data than the raw data itself.

I’d like to explore the possibility that a lot of this disagreement boils down to how we rank each of the freedoms.
For myself, i rank the freedom to use and modify most highly. I care most that there be high-quality LLMs available without use restrictions with open components. So for me, I would prefer to relax the requirements on data even beyond what is in draft 0.0.8. My fear is that if there are not high quality AI models that meet the definition, it will not have enough initial relevance to have market impact.

On the other hand if you are a researcher, focused on understanding as the primary freedom you care about, then getting as strong of a requirement for openness of data as you can might be important.