Hi @mjbommar, @samj and all, I’m sorry for the delay but I’ve just got back access to my account after being silenced for the proposal to separate concerns between source data and processing information.
The idea was simply to require training data to be available under the same terms that allowed their use in training in the first place whenever we cannot require them to be made available, so that no legal issue arise from the requirement to distribute them.
But probably my English is way worse than I suspect, so the thread is closed without any comment, but for a few questions I’m not allowed to answer.
Race conditions are bugs, not features.
So are data corruptions in RAM.
While they might constitute a sustainable technical dept in some situations, they shouldn’t be “normalized”. In fact, a lot of work as been done to achieve determinism since the GTC 2019 talk Determinism in Deep Learning.
I’d argue that it’s always possible (accepting a performance toll) to get exact training reproducibility (on the same hardware) with proper design.
But let’s suppose you face some technical limitation that affects all the existing hardware and that something along the line “sort floats before adding them” cannot fix.
The source of randomness can be identified and recorded, so that people can still study exactly how the training process produced the weights from the original data. At worst, you’d have to record the whole training process. Heavy, slow, expensive (without the proper environment) but not impossible.
Now the point is: why should you?
In several AI systems, you don’t need to do much to achieve reproducibility (the training process is inherently reproducible), but in a few ones with huge social impact, it’s needed to avoid the open washing of black boxes.
Sure, as @samj pointed out, the open source definition does not mandate reproducible builds and it’s normal to get different binaries when compiling a project with different flags.
But such non-reproducible builds do not inhibit the freedom to study the software! As XZ Utils taught us, you can always inspect the binaries and check it’s correspondence with the intended code.
Instead, afaik, with some statistical AI systems (LLM, ANN etc…) the only way to ensure that the training data declared by the developers are actually the one used to compute the weights distributed, is to replicate the process.
So the only way to prevent the Open Source AI definition to becomes a Open Washing AI definition, used to fool users and get their trust while preventing them to really study the actual system they are using, is to require such reproducibility (that, as said, might amount at worst to recording the whole training process).
I’m more than happy to learn about a different way to obtain the same guarantee about the completeness of the declared training data.
But a OSAID that can be used to open wash black boxes, negating the freedom to study the system and selling the freedom to fine tune as the system to modify, would be inherently unsecure and detrimental to users and researchers.
So I hope we can keep brainstorming, looking for a better definition that can really grant the freedoms it aims to grant, including the freedom to study and to modify the system.