An open call to test OpenVLA

quaid · June 23, 2024, 9:34pm

Now that I’m a bit caught-up on many of these discussions — I’ve been sitting on the sidelines looking for signal amidst the unexpected levels of distress displayed — I saw at least one good idea get walked by in another thread, one I believe might help inform the discussion. I also felt the soothing calm of the Debian experience and wisdom that @hartmans ladeled onto these troubled waters.

What if we bring to bear the idea that @jasonbrooks raised, which is conducting a controlled experiment on the question, “Do you need a full training dataset or is Data information enough to recreate a model that can be tested for fidelity to the original?”

Yes, it is just one experiment and one or maybe several data points. At the least, it may indicate if further experimentation is called for. It may also bring into light what @hartmans has raised, which is the difficulty of deciding for someone else what is the preferred form of modification?

So this week I was doing my journalistic due diligence understanding the new OpenVLA project:

The due diligence is for a newsletter and the research effort is heavily informed by my work with the OSI on the OSAID and in reading these discussions. For each claim of “open” for an AI project or system I’ve been diving as deep as I can to answer the question, “Is it Open Source AI or not?”

In good news for Jason’s experiment idea, I discovered what appears to be an Open Source AI project that has an OSI-licensed publicly available dataset. To double-check, I created a single AI systems review spreadsheet of the kind used by the OSI workgroup recently:

It is not complete, mainly I wanted to do the first sheet review and see if it appeared to hit all the points. I may have made some mistakes in my understanding of the criteria and my observation, but I reckon the correct answer is out there. These two projects (the model and the dataset) seem to have put a tremendous effort into being and confirming Openness. I spot-checked a half-dozen data sources in the Open X-Embodiment Dataset:

So far all I’m seeing are CC-BY licensing, which matches the claims in the project README file.

My conclusion is OpenVLA is a project that matches all the OSAID requirements and optionals, including the under-debate requirements-or-not on having an OSI-licensed/OSD-compliant dataset. Please double-check my conclusions.

Presuming for now my conclusion is accurate, how about some folks volunteer to run parallel experiments like this straw-draft:

Create a test suite against the existing openvla-7b model, verify the test suite reliably proves that a model is the same as openvla-7b.
Set about recreating openvla-7b in two ways using identical methods, except one uses the original dataset and the other only the Data information, along with all over available materials.
Run the test suite on all resulting pairs of experimental test models for fidelity to openvla7b.

By straw-draft I mean, I am not claiming this draft stands as-is for being of a high quality of scientific or machine learning methods. As a straw-draft, is it good enough for now?

Waddya think, Jason, are we in the ballpark?

shujisado · June 24, 2024, 12:04am

Such an experiment is intriguing. It would be extremely beneficial for us if someone were to conduct this experiment.
The question of whether there is complete reproducibility or to what extent reproducibility exists has always been a challenge that has troubled me.

stefano · June 24, 2024, 9:48am

This is indeed a good question. We’ve started talking to a group of CMU students who will try to run this sort of experiment. More tests like this are welcome.

Would it make sense to split this to a separate thread?

quaid · June 25, 2024, 8:16pm

I’m cool with that, I didn’t choose ‘sameness’ as an idea per se, and we clearly need a more precise criteria (even if it’s from an imprecise measurement.)

How about ‘‘functionally similar’’? Or ‘‘practically similar’’?

Would it be useful to innumerate some of the value being sought with an Open Data dataset into the proof?

Is this measurement inclusive of the ‘‘model vs model’’ matrix values, looking to keep QA benchmark values within a certain value of the original?

What else matters in a recreated/retrained-from-source model?