ViT: Vision Transformer (2020)

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

5 min readNov 2, 2022

In 2020, the Google Brain team developed Vision Transformer (ViT), an image classification model without a CNN (convolutional neural network). ViT directly applies a Transformer Encoder to sequences of image patches for classification.

ViT: Vision Transformer (2020)

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Written by Naoki