OpenAI CLIP Is Like Human Computer Eyes
I've been playing with some GPT-3 powered products, and was recently granted direct access to OpenAI beta APIs. I'm blown away by how quickly artificial general intelligence is progressing.
Along the way, I stumbled across the January 2021 introduction of CLIP (Contrastive Language–Image Pre-training) which "efficiently learns visual concepts from natural language supervision."
What the heck does that even mean?
Computer vision - the process of showing a computer something, and the computer telling you what it sees - has benefited from fast progress and evolutions of deep learning methods. This, however, has come at a cost.
The more layers "deep" you want to go - that is, the more "specific" the trained model needs to be - the more input data you need. For example, training a model to extract features of shoes, like high heels and stillettos - requires lots of images of shoes to get started. Shoes of all types. Not just the shoes - or the features of the shoes - you know you'd like the model to identify.
These input images need to be described in detail to the computer. That's where - over recent years - humans have been involved. Products like Appen (formerly Crowdflower, Figure Eight, and other names) provided access to humans to annotate images for model training purposes.
Appen is, as of today, a $2billion market cap company. They solve an expensive problem. In OpenAI's own words:
"The ImageNet dataset, one of the largest efforts in this space, required over 25,000 workers to annotate 14 million images for 22,000 object categories."
That's a lot of humans, a lot of images, and a lot of typing.
CLIP sets out to provide a cheaper alternative, leaning on "text–image pairs that are already publicly available on the internet" to remove the need for human annotated, manually labelled datasets. It takes semi-structured text from the web, with it's image pair, to accelerate pre-training tasks.
It's a more efficient approach because - in theory - it's only limited by the amount of content available on the internet. That's the potential scale of the dataset. Previous models - like ImageNet - were primarily trained on similarly structured but limited datsets, like Flickr. That's just the input.
The output will only be as general as the input. Again, limited only by the amount of content on the internet, which grows by the day. And the type of content shared varies by the day and evolves over time. As the internet grows, the level of visual understanding of CLIP will grow. In almost real time.
CLIP does, obviously, have limitations. One example of a sticker attached to a Granny Smith apple highlighted some neuron nastiness. The sticker said "iPod." So the neuron thought an apple was an iPod. Then again, humans aren't perfect at classifying things ourselves. We're fooled, too.
Either way, exciting times for all in the world of computer science. Read more about CLIP here.