The Missing Third Leg: Training Data Excluded from Open Source AI Definition by [Co-]Design

Hi @shujisado,

While I agree this might be a possible solution, license-wise, it still misses a few considerations (which are also missing in almost all conversation elsewhere, including the EU and US legislation discussions): different types of data, versioning and preservation.

Types of data:

  • Train Data
  • Test Data
  • Weights/parameters - the values that were obtained from training the system
  • Model Metadata - the ML design which is the core of the system [note: this should be considered a new type of code, not dat]
  • System Metadata - the precise description of the physical environment used to train the system, including processor (type, and variant) and hardware
  • User Data - the data supplied by the user on their queries, including the queries themselves

Versioning and Preservation: all data must be kept in a versioned dataset repository so its reuse can be tracked precisely.

I’ll be defining these terms more fully in another post in a few days.