[RFC] Separating concerns between Source Data and Processing Information

nick · September 16, 2024, 7:26pm

I’m sorry @Shamar, but this doesn’t make any sense.

It does not impose the distribution of the source data, as long as they are available to the public under the same terms that allowed their use in the training process for the original developers;

The rights to use the data for training are different than the rights to distribute such data. This has been repeatedly explained by others on this forum.

Of course, system that cannot legally share the source data would not qualify as Open Source AI, and that’s fine: we will have valuable “weights-available” AI systems just like we have valuable “source-available” software.

I don’t follow. Is the distribution of the source data imposed/required or not?

Source Data : All the data used to train the system must be available for use by the public either under licenses that comply with the Open Source Definition or under the same legal terms that allowed their use to train the system in the first place, so that a skilled person can recreate an exact copy of the system using the same data.

As others have repeatedly confirmed on this forum, it’s impossible to guarantee the recreation of an exact copy of an AI system like a LLM, even when using the same data.

Please present new arguments, don’t open new topics repeating the same arguments over and over, and leave space for others to comment. Don’t flood the conversation and be respectful of the guidelines.