Open Source AI needs to require data to be viable

We’ve been following the discussion with great interest. Stefano mentions “a few relevant efforts that support a degree of openness”. Ours is among them (see opening-up-chatgpt.github.io) and we’ve supported the argument in multiple peer-reviewed papers as well as a contribution last October to one of the OSI deep dives.

I want to share a new FAccT’24 paper of ours in which we go into these matters in some detail. I don’t want to hijack this thread, but since Stefano pointed me to it and invited me to share this here I take the liberty of providing a quick pointwise summary:

  1. We diagnose open-washing as a key threat of Big AI that hurts the open source ecology
  2. We point out a possible loophole in the EU AI Act that provides “open source” models with exemptions (and thereby gives Big AI a strong incentive to open-wash)
  3. We point out the risk of relying exclusively on a (binary) license for open source decisions
  4. We argue for a gradient and composite notion of openness to disincentivize open-washing and other forms of audit-washing.
  5. We draw attention to BloomZ’s RAIL license and propose this should count towards openness too, or at least should not necessarily detract from it (in line with what @Danish_Contractor says above).
  6. We argue that datasets represent the area that is most lagging behind in openness. A definition that is vague on this risks being toothless and vulnerable to open-washing. (I agree with @juliaferraioli OP concerning this.)

We also question the wisdom of focusing on a binary definition. As we point out:

A licence and its definition forms a single pressure point that will be targeted by corporate lobbies and big companies. The way to subvert this risk is to use a composite and gradient approach to openness.

Hopefully our evidence-based approach, survey of >45 ‘open’ LLMs, and documentation of actual ways in which companies are already open-washing provides some useful empirical background to the discussions here.

Liesenfeld, Andreas, and Mark Dingemanse. 2024. ‘Rethinking Open Source Generative AI: Open-Washing and the EU AI Act’. In The 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’24). Rio de Janeiro, Brazil: ACM. (public PDF in our institutional repository)

2 Likes