Member-only story

Swin Transformer (2021)

Hierarchical Vision Transformer using Shifted Windows

Naoki
8 min readNov 4, 2022

In 2021, Microsoft announced a new Vision Transformer called Swin Transformer, which can act as a backbone for computer vision tasks like image classification, object detection, and semantic segmentation.

The word Swin stands for Shifted windows that provide the Transformer with hierarchical vision, which is the main topic of this article.

Swin Transformer Architecture

ViT is NOT Suitable for Dense Prediction

ViT (the original Vision Transformer from Google) uses fixed-scale image patches (i.e., 16 x 16) as inputs to the encoder to extract features. Although ViT works well for image classification, it is unsuitable for more dense vision tasks, such as object detection and semantic segmentation, which require finer visual details than 16 x 16 patches. Moreover, images in those tasks are typically larger than those in image classification datasets. Using smaller image patches on larger images will increase the number of patches that self-attention layers must process.

For example, if we use 32 x 32 images, a 16 x 16 patch will produce four patches per image. For 64 x 64 images, a 16 x 16 patch means 16 patches per image. Therefore the attention mechanism must deal with four times more patches. For 128 x 128 images, the number of patches becomes 64.

Swin Transformer uses a patch size of 4 x 4, so the number of patches would be much more than the above. ViT’s global attention approach is impractical for semantic segmentation on high-resolution images.

So, we face the dilemma:

  • We need finer details for pixel-level prediction, so we want to use smaller patches.
  • The computation complexity of the attention mechanism increases quadratically to the number of image patches.

Hierarchical Feature Maps by Swin Transformer

--

--

No responses yet

Write a response