ViT: Vision Transformer (2020)

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Naoki

--

In 2020, the Google Brain team developed Vision Transformer (ViT), an image classification model without a CNN (convolutional neural network). ViT directly applies a Transformer Encoder to sequences of image patches for classification.

--

--