Case-in-Point: Zuckerberg's blog on Open Source

kjetilk · September 23, 2024, 11:37am

I just came to think about that Mark Zuckerberg’s blog post around the release of Llama 3.1, where he declares that “Open Source AI Is the Path Forward”. I feel that it is easy to agree with what he says there, the trouble arises around what he doesn’t say. I also feel that this says so much about the role of Open Source in society, and from there, we should be able derive much insight. And therefore, this thread.

The most obvious problem is that Llama 3.1 isn’t Open Source and will not be so within any likely OS AI Definition. How do we respond to that? Just by pushing Meta away, tell them to stop, go somewhere else? What would result from that? Or is there a path to where a future Llama could actually be OSAID-compliant? Is there a “Thank you, but this is where we need to be going”-path forward? Would key decision makers elsewhere realize that Llama isn’t actually Open Source? What does it have to say for the status of EU legislation? Could it bring Open Source generally into disrepute?

Further, it also calls out that Open Source seems to some largely ungoverned, or at least, governed by the same billionaire war lords that have brought society into such disarray. We know that this is generally not the case, there exists a lot of well-governed ecosystems around Open Source projects, but AI brings many new problems.

Like whether the model should have been published at all, whether the security of the thing is actually good, as Meta claims to have been very careful about it.

It also calls into question whether it is realistic to make the third party assessment that they claim.

Say that the license is fixed, would it still be Open Source as per OSAID? And what does that have to say for OSAID?

shujisado · September 23, 2024, 12:15pm

OSI has already declared that the Llama 2 License is not an Open Source license. The Llama 3.1 License is a license with only a few minor changes from Llama 2, so I don’t see much point in reiterating the same argument.

Mark Zuckerberg might be asserting that OSI’s definition of Open Source is for source code licenses and does not apply to LLMs. In some jurisdictions, it is already certain that the results of AI training do not fall under copyright, and it is true that the legal validity of Open Source licenses in such cases remains unclear. Therefore, it’s possible that Zuckerberg has a different definition of Open Source than we do.

There’s not much we can do in this situation, but I believe that the creation of OSAID is our response to him.

kjetilk · September 23, 2024, 9:06pm

Yes, as I said, and I linked to the Llama 3.1 license, it is clearly not Open Source. The contrast here lies in that Zuckerberg says many of the same things about Open Source as “we” would, for some value of “we”. That’s what makes it interesting to engage, as I don’t think it would be wise to just push them away, we should rather pull Meta in.

So, it is not just about the license, it is all the other things.

shujisado · September 23, 2024, 11:05pm

There was a time when people from AWS participated in discussions here, and based on their arguments, I suspect that OSI may have invited major companies to join the discussions as early as last year. Although AWS is not a sponsor of OSI, Meta is one of OSI’s sponsors, so they are likely well aware of OSI’s activities. Additionally, OSI members have been promoting the OSAID development process at various events like Open Source Summit and AI_dev around the world. There would have been opportunities to engage with companies like Meta at those events as well.
My hope is that representatives from major companies like Google and Meta, as well as those from AI startups, will participate more actively in the discussions, though I understand that they may have reasons that make it difficult for them to join.

That said, I have been participating in the development of OSAID solely as an individual and have only engaged in discussions in public forums, so all of the above is merely my speculation.

anon18632855 · September 24, 2024, 7:04pm

This is a pertinent point, and with the OSAID in its current state (0.0.9) we don’t really have a leg to stand on: even if we were to require metadata (“Data Information” in the current draft, whatever that means) then without access to the data itself we would have no way to verify the claims!

Not that the claims to verify are even being made any more: Llama copyright drama: Meta stops disclosing what data it uses to train the company’s giant AI models

A major battle is brewing over generative AI and copyright. Publishers want to be paid if their work has been used to train large language models. Big tech companies would rather not pay.

One way to avoid the issue is to just not tell anyone what data you used to train your AI model. Meta seems to be trying that tactic.

On Tuesday, the social-media giant release a massive new model called Llama 2. The research paper shares very little about what data was used.

“A new mix of publicly available online data,” Meta researchers wrote in the paper. That’s basically it.

This is unusual. Until now, the AI industry has been open about the training data used for models. There’s a reason: This powerful technology must be understood, and its outputs must be as explainable and traceable as possible, so that if something goes wrong researchers can go back and fix things. Training data is key to how these model perform.

We know from its announcement that “Llama 3 is pretrained on over 15T tokens that were all collected from publicly available sources [and that its] training dataset is seven times larger than that used for Llama 2, and it includes four times more code”.

We’ve also known for at least the past year that Meta Trained its New AI Using Public Instagram and Facebook Posts, and that us Aussies at least (but likely anyone outside of Europe) “don’t currently have the option to opt-out of their public posts and captions being used to train Meta’s AI systems”. We also know that Facebook owner Meta seeks to train AI model on European data as it faces privacy concerns.

Does anyone want to take the bet that Llama models don’t already include billions of tokens of Facebook and Instagram user data? I certainly don’t, but I’m also not all that bothered by it (except for the lack of opt-out). I don’t doubt Llama 3’s claim that it was “The most capable openly available LLM to date” (until they released Llama 3.1), and I agree that “Open Source AI Is the Path Forward”—actually I’ve already argued that Open Source Personal AI is the Path Forward. Only this isn’t it. Or is it?

Without a clear and strong OSAID—one that demands access to training data—there’s nothing we can do to stop such claims, or to verify them when they are made. They wouldn’t have even bothered trying to claim software to be Open Source if it wasn’t under an OSI-Approved license, but in the absence of a definition or, worse, with a weak one, we can only watch as they say what they want.

Mark · September 27, 2024, 12:16pm

Yes, I thought that Zuckerberg blogpost was just a beautiful example of the Llama/Meta playbook to capture the term open source. What noticed is that last year, the push appeared to start in the EU, with French Meta AI folks like LeCun doubling down on the term.

I don’t think it’s coincidental that they were just then starting to intensify their EU lobbying and scheming (very likely with Mistral) to enlarge their posture in the ‘open source’ field in Europe.

I also don’t think it’s a coincidence that a few months later, the EU AI Act draft was released to much fanfare and it included an exemption for “open source” models.

There’s not a lot of gain to be had by Meta from just co-opting a term like this. It is not widely known enough to be of much use to them in their marketing to the general public, and folks who know FOSS are not likely to fall for it.

But there is a huge potential gain if they can somehow achieve “open source” status in a sense that counts as that (critically undefined) term in the EU AI Act — which would enable them to escape the onerous requirements that other general purpose model providers are hampered by. This is the massive loophole we and others have pointed out.

I am happy that OSI has at least put its foot down and clearly stated that Llama currently is not open source. But it is not clear to me that the current OSI definition will in any way prevent Meta from releasing a version that just scrapes by in terms of openness. Data is their main liability, and that is precisely the element that in the current RC has been watered down to “sufficient information”, i.e., some metadata of their own choosing.

There is nothing new here. Anybody who knows a bit about the history of “fair trade” will know how multinationals like Starbucks and Nestlé were happy to jump on the band wagon — and then used their influence to water down the concept. As a sociologist noted: “co-optation … occurs primarily on the terrain of standards in the form of weakening or dilution” (Jaffee 2012).

Daniel Jaffee. 2012. Weak Coffee: Certification and Co-Optation in the Fair Trade
Movement. Social Problems 59, 1 (Feb. 2012), 94–116. https://doi.org/10.1525/sp.
2012.59.1.94

stefano · September 27, 2024, 1:04pm

I wish one day Meta will release training code, data preprocessing code, change the license of the weights, release all the URL of all the open and public data they used, disclose the provenance of all the obtainable data and document in detail all the private information they used (see FAQ). Then it should be Open Source AI.

Until then, and even with the 0.0.9 wording, Llama is not even close to be Open Source AI. I don’t understand why everybody keeps on pointing at this risk: it’s sooo clear it cannot be Open Source AI, not even by mistake, not even if just glance at it. Nope. It’s not Open Source AI… It’s on Meta to release all the details required.

Shamar · September 27, 2024, 1:33pm

Simply because Meta just need to tweak the licenses and distribute a bunch of synthetic data (or even a subset of the actual training dataset) to pass as Open Source AI under current definition.
Something Zuckerberg might announce overnight.

That’s why we talk about open washing: because if you do not have the freedom to study the released system but a different one, it’s not open source.

Just like Chrome is not open source, despite Chromium existing.

shujisado · September 27, 2024, 2:37pm

If it is truly possible to recreate a model with performance equivalent to Meta’s model using that large amount of synthetic data, wouldn’t the recreated model, which was supposed to be a copy, become the genuine model?

Shamar · September 27, 2024, 3:20pm

No, just like distributing obfuscated source code or the output of a preprocessor wouldn’t be enough to satisfy the Open Source definition.

Deliberately obfuscated source code is not allowed. […]
the output of a preprocessor or translator are not allowed.
The Open Source Definition – Open Source Initiative

nick · September 27, 2024, 6:01pm

Under the current definition, one must provide the complete data information from the original data. That goes way beyond just providing synthetic data or “obfuscated data” and is a much better / preferred form to study the released system.

The FAQ has just been updated and should address the doubts raised here.

anatta8538 · September 27, 2024, 7:37pm

At present, Meta still uses the term Open Source language, which I think we should inform Meta about the details of the terms, definitions of the license related to what he said to cause misunderstanding. It is an emphasis on what he said that may cause many people to misunderstand widely and what will happen in the future if a problem occurs, which should be clear. The meaning of the term “Open Source” that he used to support Llama and the meaning of the guidelines and terms of the license that we use as criteria for our Llama. How are they different? To be understood by the public, if he will still use this term In broad speech.

Mark · September 28, 2024, 7:38pm

if it is truly possible to recreate a model with performance equivalent to Meta’s model using that large amount of synthetic data

Ah but that’s the thing. Current OSAID proposal only requires that a ‘skilled user’ (skilled how?) be able to create a ‘substantially equivalent’ system.

Equivalent how? (Performance on automatic, proven-to-be-gameable metrics, likely.) And how much wiggle room does that hedge ‘substantially’ provide?

It just seems designed to open up all sorts of degrees of freedom that will mostly be beneficial to the big players seeking to evade scrutiny, while taking the oxygen out of the room for the smaller players who have already demonstrated meaningful openness is in fact possible.

Mark · September 28, 2024, 7:42pm

Under the current definition, one must provide the complete data information from the original data

What version of the definition? V0.9 only asks for ‘sufficiently detailed’ data information. Is the notion of ‘sufficient detail’ defined or specified anywhere that I missed? It seems quite different from ‘complete’ information.

shujisado · September 28, 2024, 10:26pm

From my understanding, OSAID 0.0.9 demands the completeness of data solely through the use of the word “recreate.” This is based on the assumption that it holds the same meaning as “reproduction” under U.S. copyright law. At least, when compared to Japanese copyright law, there is no sense of inconsistency. However, on the flip side, the only basis for requiring completeness is that single term, which admittedly makes it a weak foundation.

kjetilk · October 1, 2024, 9:37pm

The reason is that we don’t know where the real power to define what Open Source means is.

Does the OSI really have that power? Will “people” (for some value of “people”) refer to the OSAID when they talk about Open Source AI, or will they actually go with what Meta and undoubtedly others talk about? Will powerful industry players actually refer to OSAID? Will governments? And will the EU, as mentioned in the AI Act? I am not at all confident the OSI definition will stick, even though we agree that Llama 3.1 is not Open Source by the current draft or any future definition.

I think there are two crucial, if opposing, points to this: One is that Meta and others will be open-washing based on a “stickier” industry definition.

The other is all the anti-open-source rhetoric by some influential AI voices, notably most of the longtermists, Center for Humane Technology, Yuval Noah Harari, OpenAI, Centre for International Governance Innovation, and many others. They are (and will most likely continue) to point to anything that they label Open Source to say it is unsafe, unsecured, and should be banned or regulated stricter than proprietary systems. They may not at all point to systems that we think conforms with OSAID, but again, who has the power to make the definition stick?

If/when Llama or some other system causes harm, then how can OSI or the community ensure that they blame is not put on actually conforming Open Source systems?

shujisado · October 1, 2024, 11:20pm

The current push by forces like Meta to promote their own version of the term “open source AI” reminds me of Microsoft’s Shared Source Initiative around 2004. Some parts of Shared Source were open source, but it was mainly a concept that emphasized the ability to view the source code, without the freedom to modify it. Back then, our community didn’t support this Microsoft initiative, but I don’t think OSI played a big role in resisting it; OSI simply continued to protect the term “Open Source.” Yes, I think that was the most important thing.

However, unlike Shared Source, Meta is using the same term “Open Source” as we are, and I feel that within our camp, we don’t yet have a clear flag bearer like Linux. If Llama is the Windows of AI, then perhaps we do need a “Linux” for the AI world. However, that is not OSI’s job.

Shamar · October 2, 2024, 7:01am

Well, OSI run a “co-design” process to legitimate its definition.

Unfortunately, an “error” during the process gave to the Llama team (and only the Llama team) the power to exclude training and testing data from the definition by cancelling other teams’ votes.

Thanks to @anon18632855, we now know that, fixing such error, the “co-design” process leads to requiring to share such data.

Yet the FAQs still allow “unshareable private data”.

This is baffling at least.

I wouldn’t give much weights to their mumbling (nerd pun intended )

There is only so much credibility that money can buy.
And it’s pretty easy to loose: Don’t believe the hype: AGI is far from inevitable | Radboud University

Also the only real kind of open washing that we can really address here is open washing through OSI’s definition.

It can only stick based on its own quality.

While he seems to draw the wrong conclusions, the fact that @stefano (OSI President) quoted RMS and his “foundational work” several times as an authoritative source, is an example of such dynamic.

We relied heavily on foundational texts like the GNU Manifesto and the Four Freedoms of software.
[…]
…thanks to people like Richard Stallman and the GNU GPL…

I mean, we all remember that just 3 years ago, several OSI directors defined RMS as “dangerous force in the free software community for a long time”, so much dangerous to “call for the entire Board of the FSF to step down and for RMS to be removed from all leadership positions”.

And you can’t imagine how I appreciate the forthcoming public apology on the OSI’s blog, that @stefano’s words hint at.

With the obvious and clear-cut requirement that we all agree upon: with a definition that requires training data to be shared as even the “co-design” process concluded.

Training data is the most valuable asset here just like source code was the most valuable asset when OSI was created.

Everybody knows this.

The argument “you must use your own data to rebuild it but this system is still Open Source” is just like saying “you must use your own code to rebuild it but this is system is still Open Source”.