The Missing Third Leg: Training Data Excluded from Open Source AI Definition by [Co-]Design

gvlx · September 29, 2024, 9:08am

While I agree this might be a possible solution, license-wise, it still misses a few considerations (which are also missing in almost all conversation elsewhere, including the EU and US legislation discussions): different types of data, versioning and preservation.

Types of data:

Train Data
Test Data
Weights/parameters - the values that were obtained from training the system
Model Metadata - the ML design which is the core of the system [note: this should be considered a new type of code, not dat]
System Metadata - the precise description of the physical environment used to train the system, including processor (type, and variant) and hardware
User Data - the data supplied by the user on their queries, including the queries themselves

Versioning and Preservation: all data must be kept in a versioned dataset repository so its reuse can be tracked precisely.

I’ll be defining these terms more fully in another post in a few days.