Can a derivative of non-open-source AI be considered Open Source AI?

I will talk about my experience fine-tuning a “small” model (200M params) for building a feature in Firefox for describing images (Experimenting with local alt text generation in Firefox Nightly - Mozilla Hacks - the Web developer blog)

For image-to-text models like that, it’s standard practice to use a base model for both the image encoding and the text decoding, and fine tune them – so we don’t start from scratch.

I picked:

  • a VIT model for the image encoder: google/vit-base-patch16-224-in21k
  • a Distilled GPT-2 model for the text decoder: distilbert/distilgpt2

One issue we had when we trained on COCO and Flickr30k was the annotator biases.
I worked on de-biasing the dataset (see distilvit/docs/ at main · mozilla/distilvit · GitHub) and retrained it.

For example, one thing that was changed was to keep the description gender-neutral. That new round of fine tuning was very effective. And with only 30k labels, we have yet to find any case of gendered description, and we tested manually thousands of images, including adversarial ones.

What I am getting at is that if a project is very intentional in how it fine tunes base models, even if they don’t have 100% tracability of the data used downstream, and if the whole process is transparent and published as open source, I would consider the new fine tuned model as open source, and anything that would surface from the downstream data as fixable issue by the project maintainers.