CLIP: Learning Transferable Visual Models From Natural Language Supervision (2021)

Bridging the Gap Between Vision and Language — A Look at OpenAI’s CLIP Model

9 min readAug 13, 2023

Deep learning vision models traditionally relied on vast collections of labeled images, each tailored to recognize objects in a specific category or class. OpenAI’s approach, using images and natural language, offers an alternative that doesn’t necessitate such tailored examples. They developed a CLIP model that can recognize objects without needing individual training sets for each new object. Moreover, generative models like OpenAI’s DALLE-E and Stability AI’s Stable Diffusion integrate CLIP to encode input texts for text-to-image generation. It showcases the power of combining natural language processing with computer vision.

This article will examine how CLIP works and what it brings to computer vision.

Teaching Computers to Recognize Objects

The Traditional Approach

Traditionally, when we teach computers to recognize things like cats, dogs, horses, or other objects, we give them a massive collection of images with labels specifying what’s in each image. This method works well for those specified items, but what happens when we want…

CLIP: Learning Transferable Visual Models From Natural Language Supervision (2021)

Bridging the Gap Between Vision and Language — A Look at OpenAI’s CLIP Model

Teaching Computers to Recognize Objects

The Traditional Approach

Written by Naoki