Swin Transformer (2021)

Hierarchical Vision Transformer using Shifted Windows

Naoki
8 min readNov 4, 2022

--

In 2021, Microsoft announced a new Vision Transformer called Swin Transformer, which can act as a backbone for computer vision tasks like image classification, object detection, and semantic segmentation.

The word Swin stands for Shifted windows that provide the Transformer with hierarchical vision, which is the main topic of this article.

Swin Transformer Architecture

ViT is NOT Suitable for Dense Prediction

ViT (the original Vision Transformer from Google) uses fixed-scale image patches (i.e., 16 x 16) as inputs to the encoder to extract features. Although ViT works well for image classification, it is unsuitable for more dense vision tasks, such as object detection and semantic segmentation, which require finer visual details than 16 x 16 patches. Moreover, images in those tasks are typically larger than those in image classification datasets. Using smaller image patches on larger images will increase the number of patches that self-attention layers must process.

For example, if we use 32 x 32 images, a 16 x 16 patch will produce four patches per image. For 64 x 64 images, a 16 x 16 patch means 16 patches per image. Therefore the attention…

--

--