ViT: Vision Transformer (2020)

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Naoki
5 min readNov 2, 2022

--

In 2020, the Google Brain team developed Vision Transformer (ViT), an image classification model without a CNN (convolutional neural network). ViT directly applies a Transformer Encoder to sequences of image patches for classification.

This article explains how ViT works.

Vision Transformer Architecture

Vision Transformer Overview

The idea is simple: ViT splits an image into a sequence of image patch embeddings mixed with positional encoding and feeds them into a Transformer Encoder. ViT has a classification head (MLP — multi-layer perception), which produces the final prediction. The below figure shows an overview of Vision Transformer architecture.

Figure 1 of the paper

More concretely, ViT reshapes an image of shape H x W x C (height, width, channel) into a sequence of flattened patches: N x (P² C), where P is the patch size, and N is the number of patches:

--

--