OLMo 2 32B: First fully open model to outperform GPT 3.5 and GPT 4o mini

nick · March 13, 2025, 7:17pm

OLMo 2 32B was released today and from the announcement and the initial research I did, it seems that this model does comply with the OSAID:

Fully open, end-to-end recipe for language model training: Our pretraining and post-training team worked together to create a single, turn-key recipe with all ingredients openly available: data, training recipe, methodology, and software. Finally, AI researchers and developers can seamlessly build and customize a state-of-the-art pipeline for their project or application.

What are your thoughts about this model? Do you see any missing components that are required by the OSAID?

gnusupport · March 13, 2025, 10:40pm

I don’t trust AllenAI for reason of deceptive marketing by using the term “Open Source”. Yes, few models are there, but the large number of their models are based on proprietary Llama. That is quite of a contradiction to their marketing of being Open Source.

Regarding that model, look better the picture:

as it is deceptive!

They claim: State-of-the-art results: OLMo 2 32B matches or outperforms GPT3.5 Turbo, GPT4-o mini, Qwen 2.5 32 B, Mistral 24B, and approaches Qwen 2.5 72B, Llama 3.1 and 3.3 70B.

OLMo-2-0325-32B vs. Mistral-Small-24B:

Average Score:  OLMo-2-0325-32B (72.9) is slightly lower than Mistral-Small-24B (75.2).
Individual Metrics:  
    ARC/C: OLMo-2-0325-32B (90.4) is higher than Mistral-Small-24B (93.3).
    HSwag: OLMo-2-0325-32B (89.7) is lower than Mistral-Small-24B (91.3).
    WinoG: OLMo-2-0325-32B (78.7) is lower than Mistral-Small-24B (77.8).
    MMLU: OLMo-2-0325-32B (74.9) is lower than Mistral-Small-24B (80.7).
    DROP: OLMo-2-0325-32B (74.3) is lower than Mistral-Small-24B (74.4).
    NQ: OLMo-2-0325-32B (50.2) is lower than Mistral-Small-24B (42.3).
    AGIEval: OLMo-2-0325-32B (61.0) is lower than Mistral-Small-24B (69.1).
    GSM8k: OLMo-2-0325-32B (78.8) is lower than Mistral-Small-24B (79.7).
    MMLUPro: OLMo-2-0325-32B (43.3) is lower than Mistral-Small-24B (54.2).
    TriviaQA: OLMo-2-0325-32B (88.0) is lower than Mistral-Small-24B (88.8).

Overall, Mistral-Small-24B performs better in most metrics compared to OLMo-2-0325-32B.

OLMo-2-0325-32B vs. Qwen-2.5-32B:

Average Score:  OLMo-2-0325-32B (72.9) is lower than Qwen-2.5-32B (74.9).
Individual Metrics:  
    ARC/C: OLMo-2-0325-32B (90.4) is lower than Qwen-2.5-32B (95.6).
    HSwag: OLMo-2-0325-32B (89.7) is lower than Qwen-2.5-32B (96).
    WinoG: OLMo-2-0325-32B (78.7) is lower than Qwen-2.5-32B (84).
    MMLU: OLMo-2-0325-32B (74.9) is lower than Qwen-2.5-32B (83.1).
    DROP: OLMo-2-0325-32B (74.3) is higher than Qwen-2.5-32B (53.1).
    NQ: OLMo-2-0325-32B (50.2) is higher than Qwen-2.5-32B (37).
    AGIEval: OLMo-2-0325-32B (61.0) is lower than Qwen-2.5-32B (78).
    GSM8k: OLMo-2-0325-32B (78.8) is lower than Qwen-2.5-32B (83.3).
    MMLUPro: OLMo-2-0325-32B (43.3) is lower than Qwen-2.5-32B (59).
    TriviaQA: OLMo-2-0325-32B (88.0) is lower than Qwen-2.5-32B (79.9).

Qwen-2.5-32B generally outperforms OLMo-2-0325-32B across most metrics.

While OLMo-2-0325-32B shows competitive performance in some metrics, it does not consistently match or outperform Mistral-Small-24B, Qwen-2.5-32B, or Llama-3.1-70B. The claim that it matches or outperforms these models is not fully supported by the data provided. OLMo-2-0325-32B performs well but is generally outperformed by the other models mentioned.

They made letters bold on that graphics:

Which is common on LLM benchmarking graphics to point out which model is better.

However, they made it bold for their model, like it is better than others, trying to make me as visitor stupid. Like I am not going to see what is the problem.

I have got a feeling they are doing it for marketing purposes more than for the actual improvements.

zack · March 14, 2025, 7:43am

But the training pipelines of neither Mistral nor Qwen models are open source, or has that changed recently?

Mark · March 24, 2025, 8:35am

Indeed. OLMo 2 currently tops the openness leaderboard at the European Open Source AI Index, whereas Mistral and Qwen are at best middling — open weights rather than open source.

The way I read the AllenAI blog post is that it doesn’t claim that Mistral and Qwen do not also outperform GPT3.5 — rather, the claim is that OLMo2 is the first fully open model to do so. And in this it appears they are right.

zack · March 24, 2025, 9:43am

@Mark I didn’t know about the European Open Source AI Index initiative. It’s great work, thanks a lot for maintaining that and for mentioning it here. It’s a resource I’m certainly going to use a lot in the future!