Open Weights or Open Source AI?

Shamar · September 26, 2024, 3:49pm

The problem with OSAID 0.0.9 is not that it doesn’t match a truly open AI system, but that it can be satisfied by many black boxes too.

Luckily we could let them declare their work as Open Source AI, without compromising on its quality!

As far as you said, NII’s LLM matches even the alternative definition that I proposed days ago:

The preferred form of making modifications to a machine-learning system is:

Source Data: All the data used to train the system must be available for use by the public either under licenses that comply with the Open Source Definition or under the same legal terms that allowed their use to train the system in the first place, so that a skilled person can recreate an exact copy of the system using the same data.

For example, if used, this would include the training data sets used and the checksum that grant their integrity, the values obtained from random sources during the training process and any other data required to compute the weights distributed with the system.
If any part of the source data cannot be distributed by the developers, it can be referenced with URL, timestamp and hash as long as it stays available to the public under the same terms that allowed its use in training.

Processing Information: Sufficiently detailed information about the process used to train the system from the source data, so that a skilled person can recreate an exact copy of the system using the same data. Processing information shall be made available with licenses that comply with the Open Source Definition.

For example, if used, this would include the training methodologies and techniques, how to retrieve the source data and check their integrity, the labeling procedures and data cleaning methodologies, and any other information required to compute the weights distributed with the system from the source data.

Source Code: The source code used to train and run the system, made available with licenses that comply with the Open Source Definition.

For example, if used, this would include code used for pre-processing data, code used for training, validation and testing, supporting libraries like tokenizers and hyperparameters search code, inference code, and model architecture.

Weights: The model weights and parameters, made available with licenses that comply with the Open Source Definition.

For example, this might include checkpoints from key intermediate stages of training as well as the final optimizer state.

So NII’s LLM proves that such definition (maybe with better wording? I’m a non-native English speaker too! ) would be viable, as it’s ready by 24 October, and it would not define an empty set.