Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料

Yusuke Uchida (@yu4u) 株式会社 Mobility Technologies Swin Transformer: Hierarchical Vision
Transformer Using Shifted Windows 本資料はDeNA+MoTでの輪講資料に加筆したものです

2 ▪ 本家 ▪ https://github.com/microsoft/Swin- Transformer/blob/main/models/swin_transformer.py ▪ timm版（ほぼ本家のporting） ▪ https://github.com/rwightman/pytorch-image-
models/blob/master/timm/models/swin_transformer.py ▪ バックボーンとして使うならこちら ▪ https://github.com/SwinTransformer/Swin-Transformer-Object- Detection/blob/master/mmdet/models/backbones/swin_transformer.py 本家実装が参考になるので合わせて見ましょう

3 ▪ Equal contribution多すぎィどうでもいいところから

4 利用者の声個⼈の感想です

5 ▪ TransformerはNLPでデファクトバックボーンとなった ▪ TransformerをVisionにおけるCNNのように汎用的なバックボーンとすることはできないか？ → Swin Transformer! ▪
NLPとVisionのドメインの違いに対応する拡張を提案 ▪ スケールの問題 ▪ NLPではword tokenが処理の最小単位、画像はmulti-scaleの処理が重要なタスクも存在（e.g. detection） →パッチマージによる階層的な特徴マップの生成 ▪ 解像度の問題 ▪ パッチ単位よりも細かい解像度の処理が求められるタスクも存在 →Shift Windowによる計算量削減、高解像度特徴マップ実現概要

6 ▪ C2-C5特徴マップが出力でき、CNNと互換性がある ▪ チャネルが2倍で増えていく部分も同じアーキテクチャ C2 C3 C4 C5
理屈上は

7 timm版はクラス分類以外のバックボーンとしては使いづらい timm Swin-Transformer-Object-Detection この段階で avgpoolされてるちゃんと各レベルの特徴が BCHWのshapeのリストで得られる

8 timm版はクラス分類以外のバックボーンとしては使いづらい https://github.com/rwightman/pytorch-image-models/issues/614

9 ▪ 主な構成モジュールアーキテクチャ Patch Partition & Linear Embedding Patch
Merging Swin Transformer Block

10 ▪ Patch Partition ▪ ViTと同じく画像を固定サイズのパッチに分割 ▪ デフォルトだと 4x4 のパッチ
→RGB画像だと 4x4x3 次元のtokenができる ▪ Linear Embedding ▪ パッチ (token) をC次元に変換 ▪ 実際は上記2つをkernel_size=stride=パッチサイズの conv2dで行っている ▪ デフォルトではその後 Layer Normalization Patch Partition & Linear Embedding

11 ▪ 近傍 2x2 のC次元パッチを統合 ▪ concat → 4C次元 ▪
Layer Normalization ▪ Linear → 2C次元 Patch Merging (B, HW, C) にしてるのでpixel_unshuﬄe 使いづらい︖

12 ▪ Transformerのencoder layerとほぼ同じ ▪ 差分は Shifted Window-based Multi-head Self-attention
Swin Transformer Block Two Successive Swin Transformer Blocks ココがポイント

Swin Transformer Block Two Successive Swin Transformer Blocks ココがポイント Pre-norm Post-norm

14 ▪ Learning Deep Transformer Models for Machine Translation, ACL’19.
▪ On Layer Normalization in the Transformer Architecture, ICML’20. Post-norm vs. Pre-norm ResNetのpost-act, pre-actを思い出しますね︖

Swin Transformer Block Two Successive Swin Transformer Blocks ココがポイント

16 ▪ 特徴マップをサイズがMxMのwindowに区切り window内でのみself-attentionを求める ▪ hxw個のパッチが存在する特徴マップにおいて、 (hw)x(hw)の計算量が、M2xM2 x (h/M)x(w/M) =
M2hwに削減 ▪ M=7 (入力サイズ224の場合） ▪ C2（stride=4, 56x56のfeature map）だと、8x8個window Window-based Multi-head Self-attention (W-MSA) per window window数パッチ数の2乗

17 ▪ (M/2, M/2) だけwindowをshiftしたW-MSA ▪ 通常のwindow-basedと交互に適用することで隣接したwindow間でのconnectionが生まれる Shifted Window-based
Multi-head Self-attention (SW-MSA) h=w=8, M=4の例

18 ▪ 下記だと9個のwindowができるが、特徴マップをshiftしシフトなしと同じ2x2のwindowとしてattention計算 ▪ 実際は複数windowが混じっているwindowは maskを利用してwindow間のattentionを0にする効率的なSW-MSAの実装

19 実装 shift 逆shift (S)W-MSA本体

20 ▪ Self-attention自体は単なる集合のencoder ▪ Positional encodingにより系列データであることを教えている ▪ SwinではRelative Position Biasを利用
▪ Relativeにすることで、translation invarianceを表現 Relative Position Bias Window内の相対的な位置関係によって attention強度を調整（learnable）

21 ▪ 相対位置関係は縦横[−M + 1, M −1]のrangeで(2M-1)2パターン ▪ このbiasとindexの関係を保持しておき、使うときに引く実装

22 ▪ On Position Embeddings in BERT, ICLR’21 ▪ https://openreview.net/forum?id=onxoVA9FxMw
▪ https://twitter.com/akivajp/status/1442241252204814336 ▪ Rethinking and Improving Relative Position Encoding for Vision Transformer, ICCV’21. thanks to @sasaki_ts ▪ CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows, arXiv’21. thanks to @Ocha_Cocoa Positional Encoding（余談）

23 img_size (int | tuple(int)): Input image size. Default 224
patch_size (int | tuple(int)): Patch size. Default: 4 in_chans (int): Number of input image channels. Default: 3 num_classes (int): Number of classes for classification head. Default: 1000 embed_dim (int): Patch embedding dimension. Default: 96 depths (tuple(int)): Depth of each Swin Transformer layer. [2, 2, 6, 2] num_heads (tuple(int)): Number of attention heads in different layers. [3, 6, 12, 24] window_size (int): Window size. Default: 7 mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4 qkv_bias (bool): If True, add a learnable bias to query, key, value. Default: True qk_scale (float): Override default qk scale of head_dim ** -0.5 if set. Default: None drop_rate (float): Dropout rate. Default: 0 attn_drop_rate (float): Attention dropout rate. Default: 0 drop_path_rate (float): Stochastic depth rate. Default: 0.1 norm_layer (nn.Module): Normalization layer. Default: nn.LayerNorm. ape (bool): If True, add absolute position embedding to the patch embedding. Default: False patch_norm (bool): If True, add normalization after patch embedding. Default: True use_checkpoint (bool): Whether to use checkpointing to save memory. Default: False パラメータとか Stochastic depthをガッツリ使っている次元の増加に合わせhead数増加

24 ▪ クラス分類学習時stochastic depthのdrop確率 T: 0.2, S: 0.3, B: 0.5
▪ Detection, segmentationだと全て0.2 Model Configuration

25 ▪ MSAとMLP (FF) 両方に適用 Stochastic Depth

26 ▪ SOTA! SUGOI! 実験結果

27 ▪ Shifted window, rel. pos.重要 Ablation Study

28 ▪ Shiftedが精度同等で高速 Sliding window vs. shifted window

29 ▪ チャネルを2等分して、縦横のstripeでのself-attention 関連手法：CSWin Transformer X. Dong, et al., "CSWin
Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows," in arXiv:2107.00652.

30 🤔 関連手法：Pyramid Vision Transformer W. Wang, et al., "Pyramid
Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions," in Proc. of ICCV, 2021. https://github.com/whai362/PVT

31 関連手法：Pyramid Vision Transformer W. Wang, et al., "Pyramid Vision
Transformer: A Versatile Backbone for Dense Prediction without Convolutions," in Proc. of ICCV, 2021.

Transformer: A Versatile Backbone for Dense Prediction without Convolutions," in Proc. of ICCV, 2021. 複数パッチを統合してﬂatten, liner, norm linerとnormの順番が逆なだけでPatch Mergingと同じ

Transformer: A Versatile Backbone for Dense Prediction without Convolutions," in Proc. of ICCV, 2021. Position Embeddingは普通の学習するやつ

Transformer: A Versatile Backbone for Dense Prediction without Convolutions," in Proc. of ICCV, 2021. Spatial-Reduction Attention (SRA) がポイント

35 ▪ K, V（辞書側）のみ空間サイズを縮小 ▪ 実装としてはConv2D -> LayerNorm ▪ Qはそのままなので
出力サイズは変わらない ▪ 削減率は8, 4, 2, 1 とstrideに合わせる Spatial-Reduction Attention (SRA)

36 ▪ V2もあるよ！ ▪ 2020年ではなく2021年なので誰かPR出してあげてください関連手法：Pyramid Vision Transformer https://github.com/whai362/PVT

37 ▪ でっかいモデルをGPUになんとか押し込みました！ ▪ post-normになってる… 関連手法：Swin Transformer V2 Ze Liu,
et al., "Swin Transformer V2: Scaling Up Capacity and Resolution," in arXiv:2111.09883.

38 ▪ Token mixerよりもTransformerの一般的な構造自体が重要 ▪ Token mixer = self-attention, MLP
▪ Token mixerが単なるpoolingのPoolFormerを提案関連手法： MetaFormer W. Yu, et al., "MetaFormer is Actually What You Need for Vision," in arXiv:2111.11418. Conv3x3 stride=2 Ave pool3x3

Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料

Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料

More Decks by yu4u

Other Decks in Technology

Featured

Transcript