Open Weights or Open Source AI?

anon18632855 · September 26, 2024, 9:06am

Sharing my latest post on the topic given Meta’s “Open Source AI” announcements today, along with my proposal for a path forward. I understand the likes of @Shamar and @thesteve0 have received bans for straying from the approved talking points, but it complies with the community guidelines and is entirely relevant to recent discussions which are already public; it would have been inappropriate not to share it with the community given it refers to it. — Sam

Bigger is Better?

During his keynote at Meta Connect 2024 — the relevant parts of which I encourage you to watch in 10 minutes or 5 minutes at 2x (I’ve transcribed it to subtitles so you can) — Mark Zuckerberg announced some really exciting developments in AI, both in products (glasses, VR headsets, voice and video capabilities, etc.), but also in the release of Llama 3.2: Llama 3.2: Revolutionizing edge AI and vision with open, customizable models. Absent this announcement was the usual larger models with more parameters learned from more training tokens, suggesting we may have found a limit of size vs performance/efficiency.

The introduction of multi-modal capabilities with small and medium-sized vision LLMs (11B and 90B) is going to spark the development some really interesting applications, but it’s the lightweight text-only models (1B and 3B) that will bring AI to devices that have so far been out of its reach. And not necessarily devices with dedicated hardware either; phones are obvious candidates, but it won’t be long before we’re discussing dinner options with a Thermomix.

Not afraid to ruffle feathers though, Mark doubled down on the contentious “Open Source AI” branding used to describe Llama models in the past. Already last month Meta is accused of “bullying” the open-source community, with the Open Source Initiative (OSI) releasing a statement that Meta’s LLaMa 2 license is not Open Source in July, based on applying the decades-old Open Source Definition to its license.

This branding is no accident, and in the absence of an agreed “Open Source AI” definition, there’s little the Free and Open Source Software (FOSS) community can do about it. I have no doubt that it resonates well in the business communities Meta are targeting, and it may even have a moderate impact on their main consumer market, but it does make life difficult for actual Open Source projects like our own Personal Artificial Intelligence Operating System (pAI-OS) — for which the “Linux of AI” moniker they also claimed would be far more fitting (indeed, we aim to become the Linux of Personal AI).

Apple’s Intelligence

You’d have to be living under a rock to have missed the launch of Apple Intelligence, even if only via its dependency on the new iPhone 16 (and, happily, my iPhone 15 Pro and M-series iPads and MacBook Pros). That’s because they’ve managed to pack a lot of punch into the ~3 billion parameter models on-device, requiring the next generation of hardware to operate: Introducing Apple’s On-Device and Server Foundation Models.

If your task is too big for the device, it gets shipped off to their new Private Cloud Compute service (yes, I had to check we hadn’t travelled back to 2007 for the early cloud computing discussions!): Private Cloud Compute: A new frontier for AI privacy in the cloud. Too big for that and they’ve done a deal with OpenAI to access some of the biggest and best models available today.

While it didn’t ship with iOS 18.0, it’s coming with 18.1 and I’ve already had a few weeks testing it via Apple’s Developer Beta. We’re sworn to secrecy but suffice to say I’ve been impressed with the non-invasive nature of it doing things like summarising notification groups and suggesting responses. My favourite feature is a new focus setting that uses AI to reduce interruptions.

None of this is open though, and nor does it need nor claim to be: to use it you’ll need recent Apple hardware, which is where Llama differs in that it will run almost anywhere given the requisite resources. It comes with an envious amount of vendor lock-in other OEMs have relinquished to the likes of Google and Meta, but for many Apple users including this one, it’s more love-in than lock-in (as a Google colleague used to say).

Open is Closed

So far, the more “Open” appears in an artificial intelligence product’s branding, the less open it actually tends to be. OpenAI is anything but open, with its models staying server-side like the secret recipe for Coca Cola. And that’s fine, because nobody says that everybody has to be open. Indeed, most software products and services aren’t! But if you want to be, you should say what you do in terms of openness, and do what you say. I understand they’re changing the logo, but it’s a shame it’s likely too late to revisit the name.

The only significant exception to the closed nature of OpenAI’s models is the directly accessible deployments on Microsoft Azure, which has been a boon for both companies. If not for the board’s firing of Sam Altman from his own company on 17 November 2023, which sent shockwaves throughout the industry, I’d have wondered if this wasn’t seen internally as a strategic error from the early days that they would want to correct. Due to that event though, customers can feel safe in the knowledge that it’s now burnt into Microsoft’s DNA and isn’t going anywhere — Satya Nadella would have made sure of it in the aftermath. Sam was re-hired not long after and will likely soon receive well-earned equity in their transition from non-profit to for-profit (though I see several other executives have departed in the past weeks, with 3 today including their CTO).

OpenAI’s OpenAPI spec for the OpenAI API — say that three times quickly! — is also becoming a de facto standard, for better or worse. It’s licensed under a permissive Open Source license, which is a solid start, but the governance process is closed.

Open Weights

Having set the scene for some of the main types of AI we’re seeing in the wild today: those delivered as cloud services over the internet from huge server farms (OpenAI), those black boxes that run on your device but with which you can only interact through well-defined APIs (Apple), and those you can integrate with your own products and services (Meta’s Llama), let’s take a closer look at their openness.

In the first instance (OpenAI), all bets are off because you can’t even access the model weights; they’re sitting on servers like famous recipes in safes. In the second (Apple), to the extent there’s not obfuscation and even encryption preventing you from repurposing the models, you’re soon going to run into legal and copyright issues if you tried to do so in a commercial context at least. Better stick with the APIs they make available to you. And in the third, Meta’s Llama, you can access the weights in that you can download and run the models in whatever context you like — subject to “safety” limitations burnt into the models as well as the licenses they are made available to you under — but that’s about it.

If they would just brand their licensing as the self-describing “Open Weights” then there wouldn’t be another word said about it, but…

Mr Zuckerberg’s eagerness to shape what is meant by open-source AI is understandable. Llama sets itself apart from proprietary LLMs produced by the likes of OpenAI and Google on the openness of its architecture, rather as Apple, the iPhone-maker, uses privacy as a selling point. — The Economist

Four Freedoms

Of the four freedoms approved Open Source licenses set out to protect — see my article on The Open Source(ish) AI Definition (OSAID) for more — the only ones effectively extended to you by Meta with Llama are the limited freedom to Use their software, subject to their Acceptable Use Policy, and to Share it, again subject to restrictions. Want to build that disruptive new robo-advisor startup? Nope. Exercise your rights under the Second Amendment? That’s out too. Need to educate your employees on the security risks of spear phishing? Go directly to jail. Want to share it? Best make sure you’re following every law everywhere.

That’s assuming you’re willing to use a black box with unknown and unverifiable contents any more than you’re willing to consume food without knowing its ingredients too. We do know they’re training on public social media posts at least down under, with or without explicit consent — they’re all doing it by the way — but we don’t know much more than that. That’s why Meta won’t release its multimodal Llama AI model in the EU any time soon. This is arguably a good thing, as social media is the modern-day town square, but distribution of training data is fraught with copyright concerns despite being “public”.

Llama is also subject to the terms of the Llama 3.2 Community License Agreement, which range from manageable-but-problematic, like having to display “Built with Llama”, to catastrophic for certain fields of endeavour: it self-destructs once you have a certain (large) number of users, I expect to prevent their competitors from using it. It’s no wonder then that the Open Source Initiative (OSI) stood their ground in relation to their well-defined turf: the Open Source Definition.

The thing is, Meta are well within their rights to impose these conditions, and you’re going to find similar onerous terms in the End User License Agreement (EULA) of your favourite software. And you’re still going to use it because it’s like advanced alien technology you can bring in-house that is unmatched in the market today.

Nobody is telling Meta they have to be Open Source, but if they want to then they should follow the well-established rules. But what are the rules, beyond the OSD which only applies to the license itself?

Open Source AI

The tech industry can’t agree on what open-source AI means. That’s a problem. Open Source aficionados including myself are yet to agree on what the rules should be to protect the four freedoms for Artificial Intelligence (the other two are the right to Study and Modify the software) in the same way that the Open Source Definition (OSD) does for software. Attempts to find consensus have thus far failed, resulting in a contentious draft version of The Open Source(ish) AI Definition (OSAID) that we fear could be rammed through as a release candidate as soon as today at Nearlanda.

Despite breathless claims that we finally have a definition for open-source AI, we do not. Just today I discovered that votes held as part of the “co-design” process to determine what requirements make it into the definition — conspicuous in its absence the training data itself so models can be studied and modified without limitation — don’t even reflect the views of members of the working groups, let alone the wider community.

Indeed, one such working group — examining Meta’s Llama no less — was even granted the superpower to nullify other working groups’ votes! An ability that was unsurprisingly used across the board in the data category by a Meta lawyer, one of two employees invited to participate on the basis that they know their system better than those tasked with regulating it!

No surprise then that on eliminating these undocumented negative votes — a departure from democratic norms if there ever was one (assuming you even consider democracy a valid tool for defining technical standards) — the same methodology used to exclude training data sets now demands they be provided, thus giving us a path forward (which we can relax in future revisions if required):

We need to get the Open Source AI Definition (OSAID) done, but more than that, we need to get it right. To quote another open source old guard’s public appeal to the OSI board: “we can always make the definition more permissive if we discover that we have been too ambitious with a definition, but that it is functionally impossible for us to be more ambitious later.”

On the other hand, if we were to get it wrong…

Imprecision could lead to “open-washing”, says Mark Surman, head of the Mozilla Foundation. In contrast, a watertight definition would give developers confidence that they can use, copy and modify open-source models like Llama without being “at the whim” of Mr Zuckerberg’s goodwill. Which raises the tantalising question: will Zuck ever have the pluck to bare it all? — The Economist

Shamar · September 26, 2024, 11:37am

While I haven’t sold my soul to Apple’s gadgets yet (), I’ve found your analysis quite good.

However I have to suggest a correction… to the title.

Fine, Llama’s weights are… “weights”.

But how they are “open”?

Maybe you meant “Freeware weights”?
Because you can download and use them for free, but even ignoring the license, they are literally black boxes.

PS:

Actually, the public training data used by Meta are not only subject to copyright law, but also to personal data protection law such as the GDPR.

Yet assuming that you have the right to use such data for training a LLM, but not to redistribute them, nothing prevent you from distributing a dataset with URL, timestamps and sha512sum of each public record you used.

So there is no real constraint, neither legal nor technical, that prevents Meta to grant the right to study and to modify Llama models.

Except, of course, Zuckerberg will.

shujisado · September 26, 2024, 2:48pm

@anon18632855 -san,
I thought this topic should be in a separate thread, so I’m glad you created one.

I believe the draft for RC1 might not be ready yet. The D+/D- discussions need to be reflected after version 0.0.9. That alone seems like a challenging task, and considering the discussions from the past few days, the release of RC1 still seems premature. Additionally, it took three months between the 0.0.8 and 0.0.9 releases, which was already several months behind the original schedule. Given the delays, I don’t personally feel there is any particular reason to insist on All Things Open as the goal for the 1.0 release. (As a non-native English speaker, I face the challenge of sleep deprivation from keeping up with everyone’s discussions.)

As for your point about the voting in the working group, I feel it is necessary to verify whether it was fair. If your concerns are valid, even if we manage to create the correct OSAID, more people will lose trust in OSI, which is not a good outcome.

However, I don’t think there is any need to halt the current development process. As Zack-san expressed in another topic, I also believe we can improve the current 0.0.9 to a point where most people will be satisfied.

Currently, Japan’s National Institute of Informatics (NII) is continuing development with the goal of making GPT-3-class large language models reproducible, even by ordinary companies, while ensuring that all necessary data, code, and developed models are published under Open Source compliant licenses. Several models ranging from 1.3B to 172B have already been released, and all the datasets used are publicly available on the following site. The licenses are CC BY and ODC-BY.
https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v3

If CC BY and ODC-BY are considered OSD-compliant, then all components developed by this organization are open source. Thus, all AI systems developed by this organization meet the conditions required by OSAID 0.0.9. In other words, under the current OSAID, these systems would be considered Open Source AI.

However, they have not declared their development results as Open Source in any context. I believe this is because they are aware of the risk that their work could unintentionally cease to be Open Source. If OSAID ultimately requires the completeness of datasets, then if any part of a dataset is lost for any reason, their AI system could no longer be called Open Source AI. In that case, the safest solution is not to claim to be Open Source from the beginning. Yes, there are cases that are the exact opposite of companies like Meta, who falsely claim to be Open Source.

I want them to be able to proudly declare their work as Open Source AI, so I believe we need to allow for at least that level of imperfection. This is not to say that the current 0.0.9 wording should remain unchanged.

Shamar · September 26, 2024, 3:49pm

The problem with OSAID 0.0.9 is not that it doesn’t match a truly open AI system, but that it can be satisfied by many black boxes too.

Luckily we could let them declare their work as Open Source AI, without compromising on its quality!

As far as you said, NII’s LLM matches even the alternative definition that I proposed days ago:

The preferred form of making modifications to a machine-learning system is:

Source Data: All the data used to train the system must be available for use by the public either under licenses that comply with the Open Source Definition or under the same legal terms that allowed their use to train the system in the first place, so that a skilled person can recreate an exact copy of the system using the same data.

For example, if used, this would include the training data sets used and the checksum that grant their integrity, the values obtained from random sources during the training process and any other data required to compute the weights distributed with the system.
If any part of the source data cannot be distributed by the developers, it can be referenced with URL, timestamp and hash as long as it stays available to the public under the same terms that allowed its use in training.

Processing Information: Sufficiently detailed information about the process used to train the system from the source data, so that a skilled person can recreate an exact copy of the system using the same data. Processing information shall be made available with licenses that comply with the Open Source Definition.

For example, if used, this would include the training methodologies and techniques, how to retrieve the source data and check their integrity, the labeling procedures and data cleaning methodologies, and any other information required to compute the weights distributed with the system from the source data.

Source Code: The source code used to train and run the system, made available with licenses that comply with the Open Source Definition.

For example, if used, this would include code used for pre-processing data, code used for training, validation and testing, supporting libraries like tokenizers and hyperparameters search code, inference code, and model architecture.

Weights: The model weights and parameters, made available with licenses that comply with the Open Source Definition.

For example, this might include checkpoints from key intermediate stages of training as well as the final optimizer state.

So NII’s LLM proves that such definition (maybe with better wording? I’m a non-native English speaker too! ) would be viable, as it’s ready by 24 October, and it would not define an empty set.

anon18632855 · September 26, 2024, 4:39pm

It’s not our job to regulate the term “Open” (or “Freeware” for that matter, to the extent anyone uses the term or knows what it entails today), but the OSI has successfully established itself as the arbiter of the term “Open Source” over the past quarter century, at least within our industry — according to OSI co-founder Bruce Perens recently, “The common person doesn’t know about Open Source, they don’t know about the freedoms we promote which are increasingly in their interest.”

That we’ve done so even without the government enforced monopoly of a trademark for the term (which is merely descriptive, as Software in the Public Interest apparently discovered around the time of the launch) is an impressive feat. It goes to the quality of the Open Source Definition and its ability to walk the tightrope to find a balance between Free software and the needs of businesses.

Our success is also in no small part due to the deference of behemoths in the ecosystem and their unwillingness to blatantly attempt to co-opt the term, at least until now — you don’t see Microsoft claiming Windows is Open Source. Indeed, initially they were actively hostile to it, which gave us some breathing room (others did not exist yet). That there is evidence of corporate capture of the “co-design” process today should give us pause, whether there is an actual conflict (as appears to be the case), or merely the appearance of conflicts of interest.

The objective of this exercise is to replicate that early success today for AI, and while the discussion has been difficult at times, the strongest steel is forged in the hottest fire. It’s been nearly 20 years since I moved to France to start beating the drum of cloud computing, and yet we still haven’t addressed the transition from products to services that undermines our licenses. It hasn’t even been two years since OpenAI captured the public’s imagination with the November 2022 launch of ChatGPT, so we can take a little more time rather than rush it out today — which for all we know is still the plan.

And yes, there are myriad challenges in distributing training data sets, but addressing them rather than brushing them under the rug is well worth the effort.

anon18632855 · September 27, 2024, 5:56am

Agreed 100%, but if something absolutely has to go out the door due to contracts/budgets/events/etc., we could easily add training/testing data, documentation, papers, etc. to a release candidate and loosen it up later (the other way doesn’t work for obvious reasons). It would have been good to have had a counterpoint to yesterday’s keynote too, not to mention today’s Hugging Face milestone of “1,000,000 free public models” in an hour.

I had conflicts today so missed @Mer’s live presentation, but I saw on X we were asking for endorsers which isn’t a great sign. Let’s see what @stefano has to say at the town hall.

You know I believe there never should have been a vote, and that we must devise a functional litmus test like the OSD — perhaps one that can be self-determined with exceptions investigated so it can scale (a million models is a lot of models and they can’t just pick a pre-approved license!). Did any practitioners actually demonstrate they could exercise any or all of the four freedoms with any candidate models? That’s probably a good place to start, and it would certainly be more useful than vanity metrics like endorsers and votes.

The vote did happen though, and faulty data was relied upon in isolation for the single most important decision of the design: what, if anything, could be dropped from the definition and still leave it functional. I’m going to dive deeper into it in my own talk tomorrow, having since had a closer look at other statistics like vote distribution:

Agreed, we don’t need to toss out any work done to date, and interpreting the voting data without manipulations still gives a useful starting point above for a release candidate that should satisfy most people (we don’t need to and will never achieve unanimity).

This is an exciting example of the sort of project we should be protecting with a meaningful Open Source AI Definition that allows them differentiate not only from service-only offerings like OpenAI, but also from black-box products like Llama.

We could lose ourselves in licensing here, so let’s not, but “Common Crawl maintains a free, open repository of web crawl data that can be used by anyone”, subject to their Llama-like Terms of Use. You might think of this as a toxic waste dump of data from a copy-and-other-rights perspective, but the reality is that it works for very many models and it’s even hosted on Open Data on AWS (to the point about hosting challenges and data sets disappearing, which would rightly revoke Open Source AI status).

Let’s say I have Llama’s black box in one hand, and a model trained on CC in the other, I can clearly Use, Study, Modify, and Share the latter but not the former. For example, I could study the CC sources to determine what classes of NSFW content were present, then filter them and re-train (not just fine-tune) the model to share it as an LLM for libraries. You quickly end up in a discussion about fair use negating copyright claims etc., but the 501(c)3 non-profit CC Foundation run a notification procedure to handle such claims on your behalf before you even see the data anyway.

I’m not saying we should allow data made available under terms rather than licenses like CC-BY, but we could, and still protect the four freedoms. If we decide that training data is required — which I believe is self-evident and in any case the will of the majority — then we have options other than just declaring it to be too hard.

Shamar · September 27, 2024, 7:45am

Totally agree.

Now, let’s suppose we were going to follow this path: how would you patch the existing definition to add such requirements?

I proposed a solution in this direction, but I’m more than happy to change it as long as all the four freedoms remain granted to all users equally.

What about:

Source Data: All the data used to train the system shall be available for use by the public under licenses that comply with the Open Source Definition, so that a skilled person can recreate an exact copy of the system using the same data.
Whenever a part of such data is available to the public under the same legal terms that allowed their use to train the system but cannot be distributed by the developers, the Processing Information shall include details about how to obtain such data and verify their integrity and completeness.