Training data access

zack · January 30, 2024, 8:18am

Just echoing here my position from previous discussions.

I consider that full access to the original training data (as well as the training code, although this is off-topic for this thread) is a requirement to exercise the freedom to study an “AI system” based on machine learning. Short of that access, it is not possible to fully implement relevant use cases for transparency, such as analyzing biases efficiently.

Note that to fulfill this requirement alone, one does not need the ability to redistribute the training dataset, just to access it. However, some common current practices (e.g., signing TOS-style agreements before accessing the data), remain problematic and we should probably consider that such data gatekeeping is not acceptable.

I do think that the ability to redistribute the original training dataset should be a requirement for the freedom to modify an ML-based AI system, because short of that the amount of modifications one can do to such a system is much more limited.

Cheers