Sharing this article from Anastassia Kornilova on LinkedIn:
What does Phi-4 tell us about Open Source licenses? Microsoft’s Phi-4 was released under the permissive MIT license (making it Open Source Software), but it is not necessarily “Open Source AI” (OSAI).
The MIT license allows the code and the weights of the Phi-4 models to be used or modified for any purpose. In practical terms, this means that:
- Phi-4 can be used in a commercial product with no restrictions
- Phi-4 or content generated with it can be used to train other models (OpenAI and Anthropic have restrictions on this type of use)
- Phi-4 can be fine-tuned without attribution (in contrast derivatives on Llama-3 models must use names that begin with Llama-3)
And yet… Phi-4 falls short when it comes to releasing information about the training data. Releasing the full training data is not considered necessary under many definitions of OSAI, but Phi-4 does not meet even more relaxed standards. While the Technical Report (LinkedIn) does provide many interesting insights into the process, major gaps include:
- Defining “high-quality educational content”: A central tenant of the Phi-4 training philosophy is that a smaller amount of ‘good’ data can beat large web dumps. There are several gaps in defining such content. (OSI’s definition of OSAI requires the code used to create and process the data to be shared - it is not public)
- Post-training data: Phi-4 models were post-trained using DPO for alignment with human preference and for minimization of unwanted content (e.g. hate speech). Limited details are available about these datasets, despite their importance to the final model behavior.
Some of these gaps could be addressed by reporting on model limitations and red-teaming results. The Technical Report summarizes these findings, but I think that more depth is required to that section. So…why transparency?
- Regulatory requirements: Open-source models may be governed differently under various laws. The EU AI Act definition of OS focuses on the final model not the data, but others laws may approach this differently.
- Safety: In general, models can be jailbroken…understanding what goes into the training data is the most reliable way to understand what the model can output.
- Fairness: Phi-4 focuses on “high-quality educational content” - there is nuance to this concept that is worth exploring.
The MIT license is very permissive and a valuable contribution to the research community. Phi-4 may, also, be a safe and cheap solution for many applications. This post highlights how Phi-4 illustrates the differences between open software and AI systems.