ViT: Vision Transformer (2020)
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
In 2020, the Google Brain team developed Vision Transformer (ViT), an image classification model without a CNN (convolutional neural network). ViT directly applies a Transformer Encoder to sequences of image patches for classification.