The recently released DeepSeek-R1 (https://api-docs.deepseek.com/news/news250120) is being trumpeted all over as being “more” open source than others, and it’s true that it uses the MIT license for code and weights, but I’ve not seen anything clear about the training data yet.
Perhaps we can share relevant info here about this model?
I haven’t investigated DeepSeek at all yet, but based on the rumors I’ve heard, I suspect there might be an issue with the data rather than the code. I believe they are skillfully making use of synthetic data, but I’m concerned about its source. If there are any contractual restrictions on how the synthetic data can be used, it could introduce legal uncertainty for the resulting model. If you happen to have detailed information about DeepSeek’s training data, I would very much appreciate it if you could share.
You are right about the lack of data information for DeepSeek, which is a requirement from the OSAID.
Also, the OSAID explicitly states that the:
code shall represent the full specification of how the data was processed and filtered, and how the training was done. Code shall be made available under OSI-approved licenses.
From what I could find, although the DeepSeek code is licensed under MIT, the particular code to process and filter and train the data is missing. This is where the “secret sauce” is.
Their paper provides some clues as to their approach:
Nathan Lambert from AI2 provides a good analysis of the process:
But again, both the 1) data information and the 2) code to process, filter, and train the data are missing, thus failing to meet the OSAID.
Our science team has started working on fully reproducing and open-sourcing R1 including training data, training scripts,…
Full power of open source AI so that everyone all over the world can take advantage of AI progress! Will help debunk some myths I’m sure too.
I think it’s worth watching this repo and participating:
Parts of DeepSeek appear to be OSS while other parts are not. For example, in Deepseek-LLM, there’s this Model License which is not OSD-compliant due to a long list of domain restrictions (and possibly other issues, I haven’t reviewed it in depth).
Deepseek seems to be drawing a distinction with licenses between R-1, v2, and v3. V2 and V3 clearly have the license mentioned (Deepseek License Agreement) distinguishing that code us under the MIT and the model is under the Deepseek License Agreement. R-1 seems to only reference MIT from what I see; this doesn’t mean it meets the Open Source AI definition - just pointing out the difference. Also note that there is good info here: DeepSeek-R1 Release | DeepSeek API Docs . On the left, you can pull up specific version info. For example, for R-1 you’ll see this statement: * DeepSeek-R1 is now MIT licensed for clear open access. For v3, you’ll see this: Open-source spirit + Longtermism to inclusive AGI.
It’s really hard for me to tell because I don’t have a clear picture of what the dependencies are between the various Deepseek repos. Like, is R1 completely independant of Deepseek-LLM, or does it incorporate it?
DeepSeek has built in a test of meaningful opensourcishness into R1. The model, as released, censors specific content according to Chinese political policy. Can folks actually alter (and redistribute) a version without that censorship?
Japan’s CyberAgent, Inc. has already published a model in which they performed reinforcement learning on DeepSeek R1’s Qwen-distilled model using a Japanese dataset. This model apparently has the capacity to describe, in detail, events that would typically be censored in China (such as the Tiananmen Square incident), and it appears capable of switching from Chinese to Japanese reasoning. In that sense, it seems feasible to refine the model in ways people desire.
That said, if DeepSeek R1 itself uses output data from a model subject to restrictions under a different contract, releasing DeepSeek R1 might no longer be permissible. It remains unclear whether any of the major U.S. AI companies will investigate the matter to uncover such details.
More information about Hugging Face’s effort to open source DeepSeek has been published here:
Technically, R1 is “open” in that the model is permissively licensed, which means it can be deployed largely without restrictions. However, R1 isn’t “open source” by the widely accepted definition because some of the tools used to build it are shrouded in mystery. Like many high-flying AI companies, DeepSeek is loathe to reveal its secret sauce.
That’s for a different model: DeepSeek V3. See the discussion above about how R1 is different, and has an MIT license on the R1 model.
As noted in the link @moodler posted above, DeepSeek’s github for R1 says:
This code repository and the model weights are licensed under the MIT License.
I wish I had 5 hours to watch this podcast from Lex Fridman with Nathan Lambert (from AI2), about DeepSeek, Tulu, open weights, and Open Source AI. But I recommend watching the first 12 minutes and 10 minutes from 4:37:49.