Open Source AI needs to require data to be viable

Ezequiel_Lanza · June 8, 2024, 4:09pm

Hey everyone,
I’ve really enjoyed reading through this insightful thread, and I’d like to share my thoughts.

I’m leaning towards the data-information approach, which leads me to this key question:

What would be the preferred form to make modifications to the system ?

Ideally, as a Data Scientist, I’d say having the dataset available is essential because we would need to understand the data before ingesting to our models and how to pre process it to teach the model to have the behaviour we want. However, I recognize that the situation might differ for LLMs.
Arguments for data availability often cite explainability, security, and transparency. Yet, these concerns aren’t necessarily resolved just by sharing the data, they are aimed to be addressed through implementing proper safeguards and different methodologies as https://arxiv.org/pdf/2308.13387 mentioning that the datasets involved are extremelly extense and with copyright data on it (maybe). Another point is that studies confirm that we can unlearn specific segments on the LLM without having access to the initial data : https://arxiv.org/pdf/2310.02238. And more important we can see the LLM “world” is gaining popularity on trechniques to augment or contextualize the knowledge of the LLM , like Fine-Tunning, RAG, RAFT, or many other approaches that are benefited of having access to the model and not specifically the initial data set.

Given the architecture of LLMs (transformers), even a perfectly curated dataset does not guarantee the elimination of issues like hallucinations or undesired outcomes.

Therefore, I believe that sharing the dataset is not necessarily required and may not justify the potential risks associated with making it mandatory as @stefano mentioned. Importantly, this approach wouldn’t jeopardize the open-source spirit or stop any future development. By utilizing alternative methods and maintaining transparency in how models are built and trained, we can uphold open-source principles without compromising data security or integrity.

Can having information about the initial dataset be useful? Absolutely! Do I need direct access to it? Ideally, yes, but in practice, it can be avoided.

In conclusion, in my opinion, having the entire dataset is not always a preferred form to make modifications to the system as I think it’s mandatory to share the architecture and the weights!. Study and use can often be accomplished through indirect methods like model introspection and interpretability tools. Modifying and sharing outputs can be effectively managed with advanced techniques such as fine-tuning and Retrieval-Augmented Generation (RAG) without the need for the original dataset (we still need to have the data information to see if we need to “inject” external data in the way RAG or Fine-tunning methods does).

Ultimately, while the dataset can enhance these freedoms, especially for transparency and reproducibility, the evolving landscape of machine learning shows that we can often achieve the same goals through alternative methods without compromising security or intellectual property. Thus, in my opinion, it’s feasible to keep the open-source spirit without needing the complete intial dataset, particularly for Large Language Models (LLMs).