Proposal to handle Data Openness in the Open Source AI definition [RFC]

quaid · September 12, 2024, 1:19am

Friends,

As I’ve been researching many different efforts that claim various levels of Open Source AI, I’ve noticed a pattern around the types of training datasets used for a model.
This pattern has led me to a possible solution to the debates around the Open Source AI definition’s commitment to requiring Open Data.

Summary

This proposal is for an exception in the OSAI definition to not provide some or all of a training dataset under specific conditions:
when the dataset cannot reasonably be released for reasons outside of the model creator’s control and the model creators are otherwise acting with integrity in sourcing the data and toward Opening everything else, the release can claim/receive the “OSAI D-” designation to indicate the exception.

(I’m aware the actual terms for this would exist in licenses designed to be compliant to the OSAID.
For introducing this idea I am choosing a method of a D- designator that is grounded in the definition; these approaches can be improved, natch.)

Pattern to nuance

So far the discussion around this binary condition — with or without data — has also been binary.
My intention is to introduce a bit of nuance to show there seem to be five states rather than two — with one state existing in two quadrants because IP rights and provenance are obscured when kept closed (the Schroedinger’s cat of IP rights.)

To start I’ll define the pattern I’ve observed, then diagram the pattern onto an XY quadrant, and finally integrate the proposed solution into the diagram.
The four quadrants in the diagram are M, N, O, and P, which are further defined at the end.

Observed pattern with five states

This is the core of the observed pattern, with the state number and quadrant association in brackets like this: [#,Q]

For every potentially OSAI-compliant model, the training dataset either is released or not.
1. Is Released [1,N] — those who release the training dataset know they are acting with integrity to the existing Open Source definition (OSD), as well as having the rights to license the data.
2. Is Not Released — those who do not release the training dataset fall into one of two groups:
  1. No IP Rights [2,P] — they would release the dataset if they could, but they do not have the rights to release the data in one or more jurisdictions.
    They claim or appear to be acting with high integrity to overriding conditions or law, rather than selfishness.
  2. Have IP Rights Without Integrity [3,M] — they have the rights to release the dataset but do not choose to do so, most likely for one of these reasons:
    1. Commercial or research competition.
    2. They cannot prove the provenance of the dataset to be sure they have the rights to release it.
      1. Variant — they know/believe the provenance may not/does not give them the rights to release the dataset, and they wish to mitigate that risk through obscurity.
Non-OSAI models and systems are low integrity to the Open ecosystem overall, with two variants to fit into quadrants:
1. No or Unclear IP Rights — [4,O] quadrant for fully-closed.
2. Full IP Rights — [5,M] quadrant, choosing not to Openly license the software.

Here is the base diagram of for these quadrant definitions:

Organization relationship to data

Here is the full diagram with the proposed D-/D+ solution mapped to the quadrants:

Quadrants defined

The diagram restated means:

The “M” quadrant is where the model creators have the rights to license the software and full training data, but choose not to do so.
They may call properly licensed software and content “Open Source”, but their AI is closed and cannot be referred to as “Open Source AI”.

The “N” quadrant is where the model creators have the rights and integrity to release the full training data et al. With a proper license, they are OSAI D+ compliant.

The “O” quadrant is where the model creators have no rights or choose not to license the software or data, and this is the realm of all closed source AI.
The systems without proper licensing that misuse the term “Open Source” also have the option of properly licensing their software with an Open Source license as a way to stop misusing the term Open Source (and thus move to quadrant M.)

The “P” quadrant is where the model creators would release the data if they had the rights to it, but for (verifiably?) legitimate reasons cannot release the data.
If they are acting with high integrity toward the spirit and meaning of Open, and also properly license all assets except the training dataset, they are OSAI D- compliant.

NB: This diagram may be better as a matrix rather than a quadrant with XY ranges.
The ranges do not make direct sense where the conditions are binary.
However, I have an instinct there may be futher nuance that creates some level of range along the X or Y axis, and then the quadrant spurs that thinking.

Applying the proposed model

Two examples that can map into this proposed model are covered in detail in my forum post, An open call to test OpenVLA.

The Open X-Embodiment Dataset clearly fits into the N quadrant, with a properly licensed dataset.

For the OpenVLA model, there is an argument they are acting with high integrity but unable to release the dataset.
From the diligence put into the rest of the Opening work, it’s clear the model creators might release the dataset if they could.
It is eligible for the “D-” exception in the P quadrant.

It is arguably a great benefit to society for OpenVLA to advance via the effects of fully-Open Source AI development — regardless of access to the dataset, an Open Source ecosystem around this model could thrive.

In closing, a comparison

Would it be fair to compare the condition of “reasonable inability to release the dataset” to having an Open Source license for code that can only run on an IBM Zseries mainframe?
Nothing is certain, but isn’t there potential technical and cultural value to having that code as Open Source?

This proposal and the diagrams in editable SVG format are in this repo:

kjetilk · September 13, 2024, 1:20pm

I find this line of thought worthwhile to follow, but it seems that the crucial point is:

as it would need to be verifiable. The question is then whether “data information” is enough to verify. There are certainly cases where that is the case, but there may also be cases where you cannot verify because you cannot check what data has gone in and therefore you cannot check whether the rights to that data. Does that make sense?

Also, then it would be important that someone takes the task of verifying, which puts a burden on the users.

stefano · September 13, 2024, 3:29pm

I like the clarity of the quadrant, I think it’s an effective tool to explain the legal issues of distributing datasets used to create an AI system.

If I understand you correctly, you’re proposing to create two designations:

Open Source AI with Open Data (OSAID D+)
Open Source AI without Open Data (OSAID D-)

You’re also requesting that the designation of OSAID D- be reserved for developers who justify their legitimate reasons not to distribute the dataset.

Did I understand you correctly?

quaid · September 13, 2024, 5:11pm

Your questions get at the heart of why I put verification in parenthesis with a question mark — I’m not sure it’s possible or desired, but it may lead us in a better direction.

This is because if one is truly in the situation of needing the D- exception, it means you probably cannot fully prove the assertion that the dataset cannot be released. It’s a chicken-and-egg problem.

But perhaps we can lean upon the huge amount of trust we exhibit daily in accepting the self-verification from people that they have they copyright ownership for the code/content they are contributing under an Open Source license. This is the developer certificate of origin (DCO) that is essentially someone writing, “Yes, I certify that I own the copyright or have the permission to release this copyright work under such-and-such license.”

This new dataset certificate of origin (dataset DCO) might look more like this in practice:

A model creator wishes to use an OSAID-compliant license that has a data exception clause in it.
The license has a checklist for compliance, and the model creator goes through the check list to self-verify compliance. If they meet 100% of the criteria, they can use the D- exception.
As part of using the D-exception, they must publish their checklist to show answers that fulfill compliance — this it the dataset DCO. This checklist is now the public proof of compliance.
The license has a clause where the proof of compliance is tied to a statement in the license that says essentially, “If it is ever proven that any aspect of this proof of compliance is inaccurate or misleading, the immediate effect is for the model to have its license revoked and no longer be allowed to use an OSAID-compliant license.”

In the case of plagiarism, it seems clear that code shown/proven to be plagiarized would no longer be under the Open Source license. For a dataset the path to copyright enforcement might not be visible, but I believe the dataset DCO would be a form of estoppel and thus enforceable (and this is where I summon @pchestek …) This is because an estoppel is a unique promise made in a way that is enforceable in court.

Open questions from this aspect:

Would an OSAID-compliant license need to have terms that apply to all aspects of the definition? I.e., could someone create a license that skipped the database exception portion?
Is it a practical idea to have the D- dataset exception described in the definition such that it’s reasonably possible to create compliant licenses that incorporate the concept?
What is on the dataset DCO checklist? How can we have checklist items that have the potential to be verified?

For those who concern about people lying to game the system, yes, that will happen as it has on the code and content side of Open Source.

Sometimes people get caught, and that’s a deterrent. But forcing a bad actor — either individual or organization, as both are bound by the rule of law — to write down lies and make consequential promises based on those lies means a potentially higher risk for getting caught lying than the reward for the lie.

mjbommar · September 13, 2024, 8:36pm

I think this is a wonderful proposal in the right direction. In light of the increasing regulatory requirements on disclosure of sources, it’s also really not much different than what might occur anyway.

Most importantly to me, it allows users of open source software to reasonably rely on the representations made by model licensors and provides some clarity around termination/revocation, which has been such an important issue over the years.

quaid · September 13, 2024, 10:20pm

Thanks @stefano that is correct and I appreciate the simplification of my proposal. (I’ll pull it into the canonical proposal in the git repo, if that’s okay with you?)

You’re also requesting that the designation of OSAID D- be reserved for developers who justify their legitimate reasons not to distribute the dataset.

Based on the dataset DCO clarification above, how about:

“… developers who justify and certify their legitimate reasons not to distribute the dataset.”

Okay, I’m going to ride this sub-thread and continue beyond Stefano’s clarification.

Self-certification raises the questions of:

What are legitimate reasons?
How can we conceptualize these reasons without the trap of trying to list all of them?

So far I’ve found the reasons fall under two categories:

Legal (civil or criminal)
Moral and ethical reasons without legal guidance, e.g. the dataset could result in doxxing or embarrassing people in ways that are not legally actionable but carry an ethical concern.

Where the second is sticky, the first seems to focus on local copyright law, privacy law, and data privacy law. Does that list seem complete?

One legal stickiness is the reach of laws — might I violate a country’s laws on data privacy if they don’t permit Open Data and I make an Open Data dataset available outside of a country? In this scenario, I would have a version of the model available that didn’t include the dataset to serve requests from those countries.

To expand on this example: If I have a dataset that can be Open Data in Country A, but Country B doesn’t allow for Open Data in their copyright laws. Could I release my model but only make the dataset available to people not residing inside of Country B? Or perhaps I only make it available to people in Country A?

This would either a nightmare infrastructure issue of trying to block countries while keeping up with changes in global laws (cue whack-a-mole game dot gif) or we shift the burden to the person residing in Country B.

If you live in Country B, the download page only offers you the model without the dataset under the OSAID-compliant license. If you choose to download from another country (or use a VPN to do so), then you are taking on the legal risk personally of violating Country B’s law by going around the reasonable safeguards put up to keep you from violating the law.

What other scenarios play out with these D-/D+ conditions?

thesteve0 · September 13, 2024, 10:41pm

I just don’t understand why we want to spread the use of the words Open Source so much. We don’t do it for software - ala Mongo.

If a software project said they were keeping two libraries in their project proprietary but they documented it really well there is NO WAY they would be called open source.

If software companies have proprietary source but they share they binary for free we still applaud them but say your model just doesn’t fit with open source.

In the AI/ML world as compared to software:
The weights corresponds to a compiled binary. Most users are interested in only the weights and weights are the thing “released” and versioned.
The data AND the code correspond to the source code. Most users will never touch it or even look at it. I can take those two pieces and recreate the weights. I combine/compile them into the weights.

Thought experiment:
Imagine every single PostgreSQL binary in the world disappeared this instant. I could immediately recompile PostgreSQL and have a working PostgreSQL.

Image every single Llama weights definition disappeared this instant. There is no way I can rebuild those weights without access to the data and the code.

I still have yet to see any data scientist or AI/ML expert explain how they could recreate the model they are distributing without access to the data.

There is no logical way you can abide by Open Source Principles without access to the data and the code used to make the model. The fact that they compile that and then give the weights away for free is really great. That still doesn’t make it open source.

Call it open weights, free weights, open model… NOT open source.

The degree to which OSI is allowing vendors to influence the definition and secrecy of the process has me really wondering what is going on. The insistence on following this non-free definition for AI has really caused to lose respect for OSI.

shujisado · September 13, 2024, 11:55pm

Yes, being able to access the data is important. Understanding this, OSAID requires information about how the data was obtained, selected, and so on.

To use your example of PostgreSQL, if the source code can be obtained from another third-party repository rather than your own, and it is the same, then it is acceptable to direct users to that other repository.

For reference, here is the current clause on data information in OSAID:

Data information: Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data. Data information shall be made available with licenses that comply with the Open Source Definition.

For example, if used, this would include the training methodologies and techniques, the training data sets used, information about the provenance of those data sets, their scope and characteristics, how the data was obtained and selected, the labeling procedures, and data cleaning methodologies.

quaid · September 14, 2024, 12:04am

It is true there is always the solution you recommend, which is to tell everyone (my paraphrase):

As with Open Source, the Open Source AI definition is narrow and specific to only those things which are fully Open, to include Open Data, Open Weights, Open Configs, and Open Source software.

If you do not release one of those components as Open, the whole thing cannot be called Open Source AI.

Thus, a model without an Open Data dataset can only be called “a model made of Open Source, Open Weights, and Open Configs but without the Open Data.”

This is the scenario where the hammer swing smashes without nuance anything that might be Open-enough. The other hammer swing is the one smashing all unavailable datasets as “good enough” regardless of nuance.

But darn it, then you pondered this:

I just don’t understand why we want to spread the use of the words Open Source so much.

Because the world is not a binary place and a hammer may not be the tool for this situation. At the very least, the fact that friends and colleagues are at loggerheads over the situation indicates there is something worth slowing down here and thinking more deeply about before we continue.

My concern is the spreading of accurate Open practices wherever possible as part of a public good, because unintended benefits are proven to flow in Open Source ecosystems that are properly-licensed and permission-cultured — both license and vibe matter, right?

As we work to create appropriate boundaries, we want to be sure that good science, art, philosophy, et al are not excluded accidentally in our focus on machine reproducibility. The question of concern to me are the unintended consequences of swinging this hammer on behalf of the binary positions my post is in response to.

Especially making decisions with machine reproducibility at the center, when we can’t even guarantee that across two different batches of the same numbered chip sets.

I have given just two examples – lifesaving medical data that is ethically sourced (natch) and thus protected from release, and different country laws crossed with where people reside in the world – but there are more.

What is the way to extend the world of Open Collaboration to these two examples? Is it the long list of confusing terms that people will inevitably shorten to “Open Source AI” and we’ll be having this circa 2009 argument but literally a million times more people involved?

Because that is one of the things at stake right now — setting aside any hidden agendas around dataset Openness — these topics we always knew were meaningful for humanity are becoming meaningful to humanity.

When openwashing is the topic of New York Times and Forbes articles, and they’re written fairly accurately, I mean … that is the sign people are becoming aware of Open, all those voices we’ve been wanting, yearning to have at the discussion and decision tables for years.

The stakes for this work are higher than I can perceive alone, but my sense is there is nuance to resolve in the Open Data discussion. What tools beside a hammer can we bring to the problem? What happens if we pull on the thread of this pattern I’ve identified here?

Are there people who understand issues around Open Data and who don’t know we’re at these loggerheads? Let’s go find them and ask for help. Let’s get the story off the newspage and into podcasts and social media as a search for help.

These folks may have answers, they may have more use cases to solve for, they may even have someone capable of testing the value in D- cases as in An open call to test OpenVLA so we can see if the current words around training datasets are effective in practice.

thesteve0 · September 14, 2024, 12:17am

That is an interesting point that I need to think about more. It starts to touch on field of use restrictions. If the data is not publicly available at the referenced location then that appears to be a field of use restriction on the source material.

That is my initial thoughts but I will think on it more. I would like to ask what you think about the referenced source not being public as a field of use restriction

shujisado · September 14, 2024, 12:40am

I’ve been thinking about this RFC since yesterday. First of all, I believe that a classification using a matrix diagram like this would be useful for organizing our thoughts and for raising public awareness. One thing that caught my attention is that the expression “D-” might give the impression that something is inferior.

Regarding the “legitimate reasons,” I had thought there could be several, but, as in Quaid-san’s post, they can generally be divided into two types. Issues related to publicity rights may be relevant to both 1 and 2, but it seems reasonable to consider them mainly within these two categories.

shujisado · September 14, 2024, 5:56am

In my view, there are four possible reasons for not disclosing a dataset, borrowing from Quaid-san’s terminology:

P Quadrant:

Legal (civil or criminal)
Moral and ethical reasons without legal guidance

M Quadrant:

Commercial secrets or competition related reasons
Technical limitations or publication cost

Reason 4 might be considered legitimate in cases where the development is being carried out by a very small organization. However, even then, it would still not be considered “open.”

For Reason 1, GDPR is the most straightforward example. If there is a possibility of violating GDPR, the decision to keep the data private is unavoidable. Additionally, there are many countries with restrictions on cross-border data transfers.

The real issue lies with Reason 2. It is very unclear where to draw the line, and there is also the possibility that a commercial reason (Reason 3) could be disguised as an ethical issue (Reason 2).

Shamar · September 14, 2024, 8:24am

You missing a pattern that open a loophole in the whole reasoning:

OpenWashing AI: developers with low-integrity and full IP rights that pretend to comply to the OpenSource AI definition by distributing the system with a different dataset from the one used to train the model, to gain people trust in their weights, legal exemptions and full freedom to inject whatever bias or undetectable backdoor in the released weights (used by most people and organizations) without any accountability.

Also note that moral integrity has always been orthogonal to the open sourceness of a project: think for example to the backdoor implanted over the years in XZ Utils and how the morality of the developers had no impact on the legal status of the released software.

The problem with this approach is that it removes the need for a further “Open Source AI” definition, since we already have “open source software”, “open data” and “freeware”.

“Open Source AI without Open Data” is just “Open Source Software with FreeWare weights”.

Furthermore distinguishing between OSAI D+ and OSAI D- would also pose a huge legal risk on people creating derivative works from OSAI D+ systems in jurisdictions that decide to promote OSAI D+ but not OSAI D-.
In fact, a whistleblower might reveal to the authorities that an AI system that pretended to be OSAI D+ was not trained on the data released, suddenly changing the legal status of the system and of all all derivative works from OSAI D+ and OSAI D-.

The fact that it’s not “Open-enough” if you cannot really study the system because essential data and information are kept secret.
In fact, as a system, it’s not open at all.

The high integrity of the developers in the P quadrant (those who have no IP rights on the dataset they used to train the weights) would prevent them to support an ambiguity that would open wash all AI where the actual data are not shared.

zack · September 15, 2024, 8:45am

I like the idea of a “ranking” system. For context: I’ve always been in favor of a 2-tier system, where the “OSAID” brand is reserved to the your OSAID+ case, but that does not seem politically viable for OSI. If using a global brand (“OSAID”) with two sub-designations (D+, D-) is politically viable, than that would be better than the current situation with OSAID because, crucially, it would provide an incentive and an “improvement” path from D- to D+, which we currently lack in OSAID (0.9). People publishing D- systems can be called out for that by their respective stakeholders, can make efforts to go to D+ and, if they get there, take credit for becoming more open than before.

(Like others in this thrad I’m also skeptical of the “verifiability” angle, but honestly I don’t think it makes things any worse than other (non) verifiable aspects that, frankly, have always existed also with the OSD.) So, focusing on the branding aspect, I think that something like what you are proposing would be an improvement over the status quo. Thanks!

thesteve0 · September 16, 2024, 11:36pm

OSI has made no such distinction with software so why should they with AI/ML. If you want to start getting into quadrants and subtlety why not start with software - which OSI members seem to understand. Wait on Open Source AI until the members have an actual better understanding of the problem and have show that the 4 quadrant model can work and is accepted.

This is NOT the place to start experimenting with naming restrictions based on licenses

quaid · September 23, 2024, 9:40pm

Here’s the magic trick we are all pulling on ourselves — we made the distinction in computer science rather than licensing.

Thus I can create an application that is comprised of Open Source libraries pulled together using proprietary Python glue code and application components including closed source libraries, and uses an Open Source database.

Because of the programmatic separation of the library layer, application layer, and data layer, those components can be a mix of Open Source and non-Open Source. We even have the arguments about which FOSS licenses can be mixed together, etc. Each FOSS component is self-actualized through a distinction of application layers interacting with software licensing, and an implicit agreement that a proprietary library doesn’t change the IP rights of an Open Source database.

E.g. reproducible software — why does a proprietary C/C++ compiler building from Open Source still make an Open Source-compliant binary? Why does using GCC to compile closed source code change the nature of the compiled code or the resulting binary?

E.g. runtime vs application code — why does the combination of runtime and application code not result in cross-contamination of IP rights?

We do not have a similar understanding of what is happening with training data with regards to reproducible AI systems. This is more than an Open Data and privacy rights issue, we don’t yet know enough about the problems of training data that we can implement an LGPL (Lesser GPL created for better library integration) equivalent solution.

So a way to look at my solution is an anticipation of needing an LGPL-like solution for AI.

This is NOT the place to start experimenting with naming restrictions based on licenses

We’ve had a metaphorical blood-brain barrier in software for longer than the OSD, and the OSD and compliant licenses presume these natural boundaries for IP rights: a barrier that is naturally impermeable to the leaking of IP rights.

Personally, I do not disagree with your concerns, nor am I convinced a solution should live in the OSAID-compliant licenses. But my navigator-sense is there are unexplored passages in this dungeon we don’t want to miss out on exploring, you know?

thesteve0 · September 23, 2024, 10:45pm

They should go explore with the software license first. They can always loosen restrictions later.

They have been such zealots about FOSS either meets the principles or doesn’t that they have painted themselves into this corner.

Maybe Mongo should come back and point out they meet “the reasonable requirements” if you relax one of the principles. This process has been a debacle and I am no longer interested in alternative models. Either respect the principles or start loosening up on the software side of the house.

Right now the draft is hypocritical and appears as if the OSI is willing to sell their principles for corporate backing

nick · September 24, 2024, 10:16am

@quaid thank you for you thoughtful comments.

Right now the draft is hypocritical and appears as if the OSI is willing to sell their principles for corporate backing

@thesteve0 that’s absolutely not true, much the contrary. Please be respectful.

quaid · October 3, 2024, 8:06pm

tl;dr — I’m withdrawing my straw proposal as mooted by events and unusable in the real world; it served its purpose, I’m pleased, and now strike the +1 Match of Explanation to burn it down.

My intention with this straw-proposal has been to invigorate the discussion around the need for Open Data. My method included centering the many cases I saw swept into the corner and making sure the OSAID would serve them well, that is, include or exclude them in a meaningful way.

I forgot to designate this as a straw-document originally, but it is one and meant to be as useful as it could be, and burning it to the ground is inherent in the name. I am now going to burn it down and withdraw this proposal.

For the reasons below, I no longer think it’s possible or desirable to have a double-designation (“secondary brand”) for an Open-related definition if the secondary designation is antithetical to the letter and spirit of Open.

As @anon18632855 has so eloquently said, we risk not just Open Source software but the burgeoning world of other Opens — Open Science, Open Scholarship, OER, Open Innovation, Open Design, Open Data/Open Knowledge, Open Content, Open Research, Open Hardware, and Open Work in general.

All of these types of Open Work have adhered to the essential freedoms of usable, studyable, modifiable, and (re)shareable, especially while adapting Open Practices to their field or domain. Having an Open Source AI definition that allows a fundamental component to be non-Open puts all of these other fields and uses at risk.

One pattern I see too often out in the wild is some version of “Open Source means many things” or “Open Source means different things to different people.”

No, no it does not mean many things. Open Source has a definition, it’s clear and time-tested.

And now Open has a definition based on time-tested concepts, and it has proven useful. We must continue to cultivate this nearly three-decade-old definition of Open that was implied by the creation of the term Open Source.

Someone can misunderstand Open Source but that doesn’t make their misunderstanding become the new Open Source Definition — that’s not actually how Open Source works, that everyone gets to come along and change the meaning to fit their need.

(Like how “Free Software” isn’t about no-cost, Open Source isn’t about openly interpreting the meaning of Open. I.e., I suspect it is the multiple ways of using the word “open” that creates this confusion.)

The double-designation has two other problems, both I encountered while doing this work. A few others saw the same problems (perhaps more clearly) — another benefit of putting out an imperfect-proposal is the feedback that fixes or fails the proposal elements. Thanks, y’all.

The first problem is: people would call their D- system “Open Source AI” and not mention the “D-” except where contractually required or otherwise obligated. But without a trademark to enforce, the OSI cannot require models with a D- rating to always display that rating. This isn’t the NYC Health Department rating restaurants, right? We’d only ever see “Open Source AI D+” and “Open Source AI”, and nary an “Open Source AI D-” would be seen in the wild.

Previously I expressed concern that without a D+/D- designation people will call Open Source + Open Weights without data, “Open Source AI”. That concern may now be responded to with, “We may not be able to stop them but we don’t have to condone it.”

Open needs to be preserved for genuine use. The models that cannot release their datasets because it’s e.g. medical data will have to be content with the labels Open Source and Open Weights.

If the combination of Open Source + Open Weights + Open Data Information (i.e., no training dataset shared) needs a singular term, then one can be created that doesn’t use the word Open — Fair Source AI, for example.

The second problem is inherent in the implied details of the D- quadrant — any organization that has legal concerns about their training dataset isn’t going to announce this to the world.

My goal in looking for viable legal reasons for a D- was to find and address cases like in the histology slides example I’ve discussed previously.

If we carve out an exception for the 0.1% of the cases where we can genuinely observe, learn, and nod approvingly about why we might give an exception, we are opening the door for the 99.9% of the cases where the model creators will just claim the exception and say, “It’s justified, too, we can’t say why, trust us.”

That medical model can still be an “AI system made of Open Source and Open Weights, go here to contribute, etc.” There is no clear and massive value for such “corner-cases” to make it worth shifting the meaning of Open to accommodate them in the definition. They have plenty of other Opens they can lay some claim to, let that be enough.

gvlx · October 6, 2024, 4:35pm

I agree.

I would push to continue the discussion under these terms: The ZOOOM approach on Asset Openess - #3 by gvlx