Training data access

nick · February 1, 2024, 3:49pm

OLMo ( Open Language Model) was release today as truly open (their emphasis), and it includes:

Full pretraining data: The model is built on AI2’s Dolma dataset which features three trillion token open corpus for language model pretraining, including code that produces the training data.
Training code and model weights: The OLMo framework includes full model weights for four model variants at the 7B scale, each trained to at least 2T tokens. Inference code, training metrics and training logs are all provided.
Evaluation: We’ve released the evaluation suite used in development, complete with 500+ checkpoints per model, from every 1000 steps during the training process and evaluation code under the umbrella of the Catwalk project.

Might be worth studying this one as well.