DeepSeek-R1: does it conform to OSAID?

The recently released DeepSeek-R1 (https://api-docs.deepseek.com/news/news250120) is being trumpeted all over as being “more” open source than others, and it’s true that it uses the MIT license for code and weights, but I’ve not seen anything clear about the training data yet.

Perhaps we can share relevant info here about this model?

2 Likes

Where do I actually get the code? I’m not seeing any links to downloads or a GH repo anywhere.

Start here I guess:

I haven’t investigated DeepSeek at all yet, but based on the rumors I’ve heard, I suspect there might be an issue with the data rather than the code. I believe they are skillfully making use of synthetic data, but I’m concerned about its source. If there are any contractual restrictions on how the synthetic data can be used, it could introduce legal uncertainty for the resulting model. If you happen to have detailed information about DeepSeek’s training data, I would very much appreciate it if you could share.

1 Like

You are right about the lack of data information for DeepSeek, which is a requirement from the OSAID.

Also, the OSAID explicitly states that the:

code shall represent the full specification of how the data was processed and filtered, and how the training was done. Code shall be made available under OSI-approved licenses.

From what I could find, although the DeepSeek code is licensed under MIT, the particular code to process and filter and train the data is missing. This is where the “secret sauce” is.

Their paper provides some clues as to their approach:

Nathan Lambert from AI2 provides a good analysis of the process:

But again, both the 1) data information and the 2) code to process, filter, and train the data are missing, thus failing to meet the OSAID.

1 Like

From Clem Delangue (CEO at Hugging Face):

Our science team has started working on fully reproducing and open-sourcing R1 including training data, training scripts,…
Full power of open source AI so that everyone all over the world can take advantage of AI progress! Will help debunk some myths I’m sure too.

I think it’s worth watching this repo and participating:

Parts of DeepSeek appear to be OSS while other parts are not. For example, in Deepseek-LLM, there’s this Model License which is not OSD-compliant due to a long list of domain restrictions (and possibly other issues, I haven’t reviewed it in depth).

Deepseek seems to be drawing a distinction with licenses between R-1, v2, and v3. V2 and V3 clearly have the license mentioned (Deepseek License Agreement) distinguishing that code us under the MIT and the model is under the Deepseek License Agreement. R-1 seems to only reference MIT from what I see; this doesn’t mean it meets the Open Source AI definition - just pointing out the difference. Also note that there is good info here: DeepSeek-R1 Release | DeepSeek API Docs . On the left, you can pull up specific version info. For example, for R-1 you’ll see this statement: * :arrows_counterclockwise: DeepSeek-R1 is now MIT licensed for clear open access. For v3, you’ll see this: Open-source spirit + Longtermism to inclusive AGI.

4 Likes

It’s really hard for me to tell because I don’t have a clear picture of what the dependencies are between the various Deepseek repos. Like, is R1 completely independant of Deepseek-LLM, or does it incorporate it?

In terms of the model, there is a bit more clarity in their Hugging Face entry: deepseek-ai/DeepSeek-R1 · Hugging Face

This code repository and the model weights are licensed under the MIT License.

But I agree that they need to also specify how the data was processed and filtered, and how the training was done, etc.

All,

DeepSeek has built in a test of meaningful opensourcishness into R1. The model, as released, censors specific content according to Chinese political policy. Can folks actually alter (and redistribute) a version without that censorship?

Japan’s CyberAgent, Inc. has already published a model in which they performed reinforcement learning on DeepSeek R1’s Qwen-distilled model using a Japanese dataset. This model apparently has the capacity to describe, in detail, events that would typically be censored in China (such as the Tiananmen Square incident), and it appears capable of switching from Chinese to Japanese reasoning. In that sense, it seems feasible to refine the model in ways people desire.

That said, if DeepSeek R1 itself uses output data from a model subject to restrictions under a different contract, releasing DeepSeek R1 might no longer be permissible. It remains unclear whether any of the major U.S. AI companies will investigate the matter to uncover such details.

2 Likes

More information about Hugging Face’s effort to open source DeepSeek has been published here:

Technically, R1 is “open” in that the model is permissively licensed, which means it can be deployed largely without restrictions. However, R1 isn’t “open source” by the widely accepted definition because some of the tools used to build it are shrouded in mystery. Like many high-flying AI companies, DeepSeek is loathe to reveal its secret sauce.

1 Like

The model is licensed under an Open-RAIL-M License (without that naming convention) DeepSeek-V3/LICENSE-MODEL at main · deepseek-ai/DeepSeek-V3 · GitHub. This is similar to how the BigScience BLOOM model and the models from BigCode were licensed.

That’s for a different model: DeepSeek V3. See the discussion above about how R1 is different, and has an MIT license on the R1 model.
As noted in the link @moodler posted above, DeepSeek’s github for R1 says:

This code repository and the model weights are licensed under the MIT License.

Just a curiosity, according to the Model Openness Framework from the Linux Foundation, DeepSeek-R1 classifies as an Open Model:

https://mot.isitopen.ai/model/1143

2 Likes

Ahh - thanks for noting that alternative classification approach. Note that MOFT “Open Model” is the “Class III” level of their framework. Their closest parallel to OSAID is their Class I “Open Science”. It requires detailed documentation on the data, training and model, but as I see it, it doesn’t require the release of the actual code like OSAID does, just thorough documentation, ala a traditional scientific paper.
Introducing the Model Openness Framework: Promoting Completeness and Openness for Reproducibility, Transparency and Usability in AI – LFAI & Data

@stefano provided a comparison of the OSAID and the MOF. The OSAID requires components from all three classes from MOF, but not all:

Feel free to comment on that topic if you have any questions about MOF.

1 Like

I wish I had 5 hours to watch this podcast from Lex Fridman with Nathan Lambert (from AI2), about DeepSeek, Tulu, open weights, and Open Source AI. But I recommend watching the first 12 minutes and 10 minutes from 4:37:49.