DeepSeek-R1: does it conform to OSAID?

moodler · January 22, 2025, 10:33am

The recently released DeepSeek-R1 (https://api-docs.deepseek.com/news/news250120) is being trumpeted all over as being “more” open source than others, and it’s true that it uses the MIT license for code and weights, but I’ve not seen anything clear about the training data yet.

Perhaps we can share relevant info here about this model?

jberkus · January 23, 2025, 12:40am

Where do I actually get the code? I’m not seeing any links to downloads or a GH repo anywhere.

moodler · January 23, 2025, 9:19pm

Start here I guess:

shujisado · January 24, 2025, 12:05pm

I haven’t investigated DeepSeek at all yet, but based on the rumors I’ve heard, I suspect there might be an issue with the data rather than the code. I believe they are skillfully making use of synthetic data, but I’m concerned about its source. If there are any contractual restrictions on how the synthetic data can be used, it could introduce legal uncertainty for the resulting model. If you happen to have detailed information about DeepSeek’s training data, I would very much appreciate it if you could share.

nick · January 24, 2025, 3:17pm

You are right about the lack of data information for DeepSeek, which is a requirement from the OSAID.

Also, the OSAID explicitly states that the:

code shall represent the full specification of how the data was processed and filtered, and how the training was done. Code shall be made available under OSI-approved licenses.

From what I could find, although the DeepSeek code is licensed under MIT, the particular code to process and filter and train the data is missing. This is where the “secret sauce” is.

Their paper provides some clues as to their approach:

Nathan Lambert from AI2 provides a good analysis of the process:

But again, both the 1) data information and the 2) code to process, filter, and train the data are missing, thus failing to meet the OSAID.

nick · January 25, 2025, 9:13pm

From Clem Delangue (CEO at Hugging Face):

Our science team has started working on fully reproducing and open-sourcing R1 including training data, training scripts,…
Full power of open source AI so that everyone all over the world can take advantage of AI progress! Will help debunk some myths I’m sure too.

I think it’s worth watching this repo and participating:

jberkus · January 26, 2025, 7:59pm

Parts of DeepSeek appear to be OSS while other parts are not. For example, in Deepseek-LLM, there’s this Model License which is not OSD-compliant due to a long list of domain restrictions (and possibly other issues, I haven’t reviewed it in depth).

CaseyValk · January 28, 2025, 1:25pm

Deepseek seems to be drawing a distinction with licenses between R-1, v2, and v3. V2 and V3 clearly have the license mentioned (Deepseek License Agreement) distinguishing that code us under the MIT and the model is under the Deepseek License Agreement. R-1 seems to only reference MIT from what I see; this doesn’t mean it meets the Open Source AI definition - just pointing out the difference. Also note that there is good info here: DeepSeek-R1 Release | DeepSeek API Docs . On the left, you can pull up specific version info. For example, for R-1 you’ll see this statement: * DeepSeek-R1 is now MIT licensed for clear open access. For v3, you’ll see this: Open-source spirit + Longtermism to inclusive AGI.

jberkus · January 28, 2025, 5:12pm

It’s really hard for me to tell because I don’t have a clear picture of what the dependencies are between the various Deepseek repos. Like, is R1 completely independant of Deepseek-LLM, or does it incorporate it?

nealmcb · January 28, 2025, 9:21pm

In terms of the model, there is a bit more clarity in their Hugging Face entry: deepseek-ai/DeepSeek-R1 · Hugging Face

This code repository and the model weights are licensed under the MIT License.

But I agree that they need to also specify how the data was processed and filtered, and how the training was done, etc.

jberkus · January 28, 2025, 9:41pm

All,

DeepSeek has built in a test of meaningful opensourcishness into R1. The model, as released, censors specific content according to Chinese political policy. Can folks actually alter (and redistribute) a version without that censorship?

shujisado · January 29, 2025, 12:04am

Japan’s CyberAgent, Inc. has already published a model in which they performed reinforcement learning on DeepSeek R1’s Qwen-distilled model using a Japanese dataset. This model apparently has the capacity to describe, in detail, events that would typically be censored in China (such as the Tiananmen Square incident), and it appears capable of switching from Chinese to Japanese reasoning. In that sense, it seems feasible to refine the model in ways people desire.

That said, if DeepSeek R1 itself uses output data from a model subject to restrictions under a different contract, releasing DeepSeek R1 might no longer be permissible. It remains unclear whether any of the major U.S. AI companies will investigate the matter to uncover such details.

nick · January 29, 2025, 1:36pm

More information about Hugging Face’s effort to open source DeepSeek has been published here:

Technically, R1 is “open” in that the model is permissively licensed, which means it can be deployed largely without restrictions. However, R1 isn’t “open source” by the widely accepted definition because some of the tools used to build it are shrouded in mystery. Like many high-flying AI companies, DeepSeek is loathe to reveal its secret sauce.

Danish_Contractor · January 29, 2025, 11:42pm

The model is licensed under an Open-RAIL-M License (without that naming convention) DeepSeek-V3/LICENSE-MODEL at main · deepseek-ai/DeepSeek-V3 · GitHub. This is similar to how the BigScience BLOOM model and the models from BigCode were licensed.

nealmcb · January 30, 2025, 10:57pm

That’s for a different model: DeepSeek V3. See the discussion above about how R1 is different, and has an MIT license on the R1 model.
As noted in the link @moodler posted above, DeepSeek’s github for R1 says:

This code repository and the model weights are licensed under the MIT License.

nick · January 31, 2025, 1:01pm

Just a curiosity, according to the Model Openness Framework from the Linux Foundation, DeepSeek-R1 classifies as an Open Model:

https://mot.isitopen.ai/model/1143

nealmcb · February 1, 2025, 1:59pm

Ahh - thanks for noting that alternative classification approach. Note that MOFT “Open Model” is the “Class III” level of their framework. Their closest parallel to OSAID is their Class I “Open Science”. It requires detailed documentation on the data, training and model, but as I see it, it doesn’t require the release of the actual code like OSAID does, just thorough documentation, ala a traditional scientific paper.
Introducing the Model Openness Framework: Promoting Completeness and Openness for Reproducibility, Transparency and Usability in AI – LFAI & Data

nick · February 1, 2025, 2:26pm

@stefano provided a comparison of the OSAID and the MOF. The OSAID requires components from all three classes from MOF, but not all:

Feel free to comment on that topic if you have any questions about MOF.

nick · February 4, 2025, 12:38pm

I wish I had 5 hours to watch this podcast from Lex Fridman with Nathan Lambert (from AI2), about DeepSeek, Tulu, open weights, and Open Source AI. But I recommend watching the first 12 minutes and 10 minutes from 4:37:49.

nick · February 11, 2025, 11:45am

A really interesting update on HF’s open-r1 progress: