近年のHierarchical Vision Transformer

株式会社 Mobility Technologies 内田祐介 (@yu4u) 近年の Vision Transformer 〜全部同じじゃないですか〜本資料はDeNA+MoTでの
輪講資料を加筆したものです

2 ▪ ViT [1] の流行 ▪ 画像もTransformer！でも大量データ（JFT-300M）必要 ▪ DeiT [2]
▪ ViTの学習方法の確立、ImageNetだけでもCNN相当に ▪ MLP-Mixer [3] ▪ AttentionではなくMLPでもいいよ！ ▪ ViTの改良やattentionの代替（MLP, pool, shift, LSTM) 乱立背景 [1] A. Dosovitskiy, et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale," in Proc. of ICLR, 2021. [2] H. Touvron, et al., "Training Data-efficient Image Transformers & Distillation Through Attention," in Proc. of ICLR'21. [3] I. Tolstikhin, et al., "MLP-Mixer: An all-MLP Architecture for Vision," in Proc. of NeurIPS'21.

3 ▪ Stride=2のdepthwise convでpooling ▪ Deformable DETRベースの物体検出ではResNet50に負けている初期の改良例：Pooling-based Vision Transformer
(PiT) B. Heo, et al., "Rethinking Spatial Dimensions of Vision Transformers," in Proc. of ICCV'21.

4 ▪ Stage1, 2ではMBConv (MobileNetV2~やEffNetのメイン構成要素）を利用、Stage3, 4ではattention（+rel pos embedding）を利用 ▪
MBConvはstrided convで、attentionはpoolでdownsample 初期の改良例： CoAtNet Z. Dai, et al., "CoAtNet: Marrying Convolution and Attention for All Data Sizes," in Proc. of NeurIPS'21. identity residual

5 ▪ 当然ViTを物体検出やセグメンテーションにも適用したくなる ▪ CNNは入力画像の1/4から1/32までの複数解像度の特徴マップを生成 ▪ ViTは1/16のみ、小さい物体検出や細かいセグメンテーションには不向き ▪ 高解像度の特徴マップも扱いたい！ ▪
この課題をクリアしたVision Transformerをみんなが考えた結果… 背景 B. Heo, et al., "Rethinking Spatial Dimensions of Vision Transformers," in Proc. of ICCV'21. そして次に⾼速化とかが流⾏る

6 最近のVision Transformerたち（全部同じじゃないですか!? Swin Trasnformer PoolFormer ShiftViT AS-MLP Shunted Transformer
CSWin Transformer ResT SepViT Lite Vision Transformer Pyramid Vision Transformer

7 最近のVision Transformerたち（全部同じじゃないですか!? Swin Trasnformer PoolFormer ShiftViT AS-MLP Shunted Transformer
CSWin Transformer ResT SepViT Lite Vision Transformer Pyramid Vision Transformer 今⽇この資料でちがいますよー︕ ⾔えるようになる

8 ▪ 物体検出やセマンティックセグメンテーションに適用可能な階層的なVision Transformerバックボーンの紹介 ▪ ViTではなく、transformerをビジョンタスクに適用した的なモデル一般を本資料ではVision Transformerと呼ぶ ▪
DETR等、attentionをタスクを解く部分に利用する手法には触れない ▪ Attention layerは何となく分かっている前提 ▪ 入力を線形変換してQ, K, V作って ▪ softmax(Q KT) から重みを算出して、Vの重み付け和を出力する ▪ それが並列に複数ある（multi-head）くらいでOK！ ▪ 図で理解するTransformer 読みましょう！この資料で扱う範囲

9 ▪ 紹介するVision Transformerはほぼこの形で表現可能 ▪ Transformer blockのtoken mixerが主な違い ▪ MLP-Mixer,
PoolFormer, ShiftViT等のattentionを使わないモデルも token mixerが違うだけのViTと言える ▪ この構造を [1] ではMetaFormerと呼び、この構造が性能に寄与していると主張階層的Vision Transformerの一般系（CNN的な階層構造） Transformer Block [1] W. Yu, et al., "MetaFormer is Actually What You Need for Vision," in Proc. of CVPR’22. !! × # 4 × % 4 Stage 1 !" × # 8 × % 8 Stage 2 !# × # 16 × % 16 Stage 3 !$ × # 32 × % 32 Stage 4 3×#×% Input Norm Token Mixer FFN Norm + + Patch Embedding Transformer Blocks Patch Merging Transformer Blocks Patch Merging Transformer Blocks Patch Merging Transformer Blocks

10 ▪ Patch embedding：画像をパッチに分割しtoken化 ▪ Positional encoding：tokenに位置情報を付加 ▪ Patch merging：空間解像度を半分にし、チャネル数を増加させる
▪ Transformer block：token mixer (attention) とFFNによる特徴抽出階層的Vision Transformerの構成要素 Transformer Block !! × # 4 × % 4 Stage 1 !" × # 8 × % 8 Stage 2 !# × # 16 × % 16 Stage 3 !$ × # 32 × % 32 Stage 4 3×#×% Input Norm Token Mixer FFN Norm + + Patch Embedding Transformer Blocks Patch Merging Transformer Blocks Patch Merging Transformer Blocks Patch Merging Transformer Blocks

11 ▪ 画像を小さなパッチに分割し、高次元のtokenに変換する ▪ ViTでは16×16 ▪ 階層的Vision Transformerでは4×4が一般的 ▪ 実装は
▪ rearange (einops) -> linear or ▪ Conv2D (kernel size = stride = パッチサイズ） ▪ オーバーラップして分割するモデルも存在 ▪ CNNのように複数のConv2Dを利用してダウンサンプルするモデルも存在 ▪ Layer normがあったりなかったり Patch Embedding

12 ▪ Transformer (attention) 自体は集合のencoder (decoder) ▪ Positional encodingにより各tokenに位置情報を付加する必要がある ▪
色々なアプローチがある ▪ Relative or absolute × 固定（sinusoidal） or learnable [1] ▪ Conditional positional encodings [2]（面白いので本資料のappendixで紹介） ▪ FFNのconvで暗にembedする [3] ▪ Absolute positional encodingは入力のtokenに付加する ▪ オリジナルのViTはこれ ▪ Relative positional encodingはattentionの内積部分に付加 Positional Encoding [1] K. Wu, et al., "Rethinking and Improving Relative Position Encoding for Vision Transformer," in Proc. of ICCV'21. [2] X. Chu, et al., "Conditional Positional Encodings for Vision Transformers," in arXiv:2102.10882. [3] Enze Xie, et a., "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers," in Proc. of NeurIPS'21.

13 ▪ 近傍2×2のtokenを統合することで空間解像度を半分にしつつ tokenの次元数を増加（2倍が多い）させる ▪ Patch embeddingと同じ実装が多い ▪ Patch embeddingと同様にオーバーラップさせるケースも
Patch Merging

14 ▪ Layer norm, token mixer (=self-attention), feed-forward network (FFN)
(=MLP), skip connectionで構成 ▪ Self-attention部分がポイント ▪ Attention 𝑄, 𝐾, 𝑉 = softmax 𝑄𝐾T/ 𝑑 𝑉 ▪ 𝑄 = 𝑊𝑞𝑋, 𝐾 = 𝑊𝑘𝑋, 𝑉 = 𝑊𝑣𝑋 Transformer block !! × # 4 × % 4 Stage 1 !" × # 8 × % 8 Stage 2 !# × # 16 × % 16 Stage 3 !$ × # 32 × % 32 Stage 4 Norm Token Mixer FFN Norm + + Transformer Blocks Patch Merging Transformer Blocks Patch Merging Transformer Blocks Patch Merging Transformer Blocks

15 ▪ Self-attentionの計算量が系列長の二乗に比例する（𝑄𝐾Tの内積） ▪ 画像の場合は系列長＝画像サイズ（特徴マップのH×W） ▪ ViTの場合は入力画像サイズ224で14x14（入力の1/16）の特徴マップ ▪ 画像サイズを大きくして（e.g. 1280）、高解像度化（e.g.
入力の1/4）すると大変なことになる ▪ この課題をどう解決するかが各手法の違い ▪ Attentionの範囲を局所的に制限するwindow (local) attention ▪ K, Vの空間サイズを小さくするspatial-reduction attention（Qはそのまま ▪ 実はほぼ上記の2パターン（ネタバレ） ▪ 上記の2つを組み合わせたり、spatial-reductionをマルチスケールでやったり、 windowの作り方が違ったり… 高解像度の特徴マップを利用しようとした際の課題

16 ▪ Vision Transformerを物体検出やセグメンテーションタスクのバックボーンとすべく階層的なVision Transformerが提案されている ▪ これらは共通の構造を持っており下記のモジュールから構成 ▪ Patch
embedding：画像をパッチに分割しtoken化 ▪ Positional encoding：tokenに位置情報を付加 ▪ Patch merging：空間解像度を半分にし、チャネル数を増加させる ▪ Transformer block：token mixer (attention) とFFNによる特徴抽出 ▪ Transformer blockのattention部分の計算量削減がポイント ▪ Window (local) attentionとspatial-reduction attentionに大別される ▪ 以降では各モデルのtoken mixer (attention) 部分をメインに雑に解説！ここまでのまとめ

17 ▪ Token mixer: Shifted Window-based Multi-head Self-attention Swin Transformer
Two Successive Swin Transformer Blocks ココがポイント Z. Liu, et al., "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows," in Proc. of ICCV'21. Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料も見てネ！

18 ▪ 特徴マップをサイズがMxMのwindowに区切り window内でのみself-attentionを求める ▪ hxw個のパッチが存在する特徴マップにおいて、 (hw)x(hw)の計算量が、M2xM2 x (h/M)x(w/M) =
M2hwに削減 ▪ M=7 (入力サイズ224の場合） ▪ C2（stride=4, 56x56のfeature map）だと、8x8個のwindow Window-based Multi-head Self-attention (W-MSA) per window window数パッチ数の2乗

19 ▪ (M/2, M/2) だけwindowをshiftしたW-MSA ▪ 通常のwindow-basedと交互に適用することで隣接したwindow間でのconnectionが生まれる Shifted Window-based
Multi-head Self-attention (SW-MSA) h=w=8, M=4の例

20 ▪ 下記だと9個のwindowができるが、特徴マップをshiftしシフトなしと同じ2x2のwindowとしてattention計算 ▪ 実際は複数windowが混じっているwindowは attention maskを利用してwindow間のattentionを0にする（通常はdecoderで未来の情報を見ないようにするときに使う）効率的なSW-MSAの実装

21 ▪ チャネルを2等分して、縦横のstripeでのself-attention CSWin Transformer X. Dong, et al., "CSWin
Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows," in Proc. of CVPR’22.

22 ▪ でっかいモデルをGPUになんとか押し込みました！ ▪ post-normになってる… Swin Transformer V2 Ze Liu,
et al., "Swin Transformer V2: Scaling Up Capacity and Resolution," in Proc. of CVPR’22.

23 ▪ Token mixer: Spatial-Reduction Attention (SRA) Pyramid Vision Transformer
(PVT) W. Wang, et al., "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions," in Proc. of ICCV, 2021. Spatial-Reduction Attention (SRA) がポイント

24 ▪ K, V（辞書側）のみ空間サイズを縮小 ▪ 実装としてはConv2D -> LayerNorm ▪ Qはそのままなので
出力サイズは変わらない ▪ 各stageの削減率は8, 4, 2, 1 と特徴マップの縮小率と整合させる Spatial-Reduction Attention (SRA)

25 ▪ SRAのdown samplingをaverage poolに ▪ Patch embeddingにconvを使いoverlapさせる ▪ FFNにdepthwise
convを挿入し、 positional embeddingを削除（暗黙的なpositional encoding） PVTv2 W. Wang, et al., "PVTv2: Improved Baselines with Pyramid Vision Transformer," in Journal of Computational Visual Media, 2022.

26 ▪ 動画認識がメインタスクのモデル ▪ PVTと同様にK, Vをpoolingしたattention ▪ pool関数としてmax pool, average
pool, stride付きdepthwise convを比較して depthwise convが精度面で良い結果 ▪ PVT→PVTv2ではconv→average poolに変更 ▪ PVTはdepthwiseではない通常のconvだった ▪ Patch merging (downsample) を、 Qをdownsampleすることで行っているのが面白い Multiscale Vision Transformers (MViT) H. Fan, et al., "Multiscale Vision Transformers," in Proc. of ICCV'21.

27 ▪ residual pooling connectionの追加 ▪ decomposed relative position embedding
E(rel) の追加 ▪ H×W×Tのテーブルを持たず独立を仮定して次元毎に持つ MViTv2 Y. Li, et al., "MViTv2: Improved Multiscale Vision Transformers for Classification and Detection," in Proc. of CVPR'22.

28 ▪ Patch embeddingにCNNのようなstem convを利用 ▪ Positional encodingにconvolutional token embeddingを利用
▪ Q, K, Vの作成にdepthwise separable convを利用 (K, V縮小） CvT H. Wu, "CvT: Introducing Convolutions to Vision Transformers," in Proc. of ICCV'21.

29 ▪ Efficient multi-head self-attention ▪ PVTと同じでK, Vを縮小 ▪ DWConvで縮小しているのが違い
ResT Q. Zhang and Y. Yang, "ResT: An Efficient Transformer for Visual Recognition," in Proc. of NeurIPS'21.

30 ▪ これもspatial-reduction attention ▪ Head毎に異なる縮小率のK, Vを利用 ▪ 右の図が分かりやすくて素敵 Shunted
Transformer S. Ren, et al., "Shunted Self-Attention via Multi-Scale Token Aggregation," in Proc. of CVPR'22. Shunted Transformer

31 ▪ 畳み込みは高周波、attentionは低周波の情報を活用 ▪ GoogLeNetのInceptionモジュールのように両方を活用する手法 ▪ Stageが上がるにつれてattentionのチャネル率を増加させる ▪ Stage1, 2ではattentionはspatial-reduction
attention Inception Transformer C. Si, et al., "Inception Transformer," in arXiv:2205.12956.

32 ▪ LSAとGSAを繰り返すアーキテクチャ ▪ Locally-grouped self-attention (LSA)：Swinのwindow attention ▪ Global
sub-sampled attention (GSA)：PVTのspatial-reduction attention Twins X. Chu, et al., "Twins: Revisiting the Design of Spatial Attention in Vision Transformers," in Proc. of NeurIPS'21.

33 ▪ Query周辺のパッチを複数の解像度でpoolingしてK, Vとする ▪ 近傍は高解像度、遠方は低解像度 Focal Transformer（理想） J. Yang,
et al., "Focal Self-attention for Local-Global Interactions in Vision Transformers," in Proc. of NeurIPS'21.

34 ▪ Two levelでほぼlocalとglobal attention ▪ “For the focal self-attention
layer, we introduce two levels, one for fine- grain local attention and one for coarse-grain global attention” Focal Transformer（現実） J. Yang, et al., "Focal Self-attention for Local-Global Interactions in Vision Transformers," in Proc. of NeurIPS'21. Level数を L と⼀般化して図も L=3 なのに実際は 2 levelのみ…

35 ▪ SDA (window attention) と、特徴マップを空間的にshuffleしてから window attentionするLDAの組み合わせ ▪ 空間shuffleは
[2] でも利用されている ▪ 古くはCNNにShuffleNetというものがあってじゃな… CrossFormer [1] [1] W. Wang, et al., "CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention," in Proc. of ICLR'22. [2] Z. Huang, et al., "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer," in arXiv:2106.03650.

36 ▪ Attention（がメイン）じゃないやつとかおまけ

37 ▪ MobileNetと並列にglobal tokenの streamを配置 ▪ 本体はCNN ▪ cross-attentionで情報をやりとり Mobile-Former
Y. Chen, et al., "Mobile-Former: Bridging MobileNet and Transformer," in Proc. of CVPR'22. MobileNetの stream Global tokenの stream cross- attention cross- attention

38 ▪ Attentionの代わりにshift operation ▪ 空間方向（上下左右）に1 pixelずらす ▪ なのでZERO FLOPs!!!
▪ S2-MLP [2] や AS-MLP [3] といった先行手法が存在するが ShiftViTは本当にshiftだけ ShiftViT [1] [1] G. Wang, et al., "When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism," in Proc. of AAAI'22. [2] T. Yu, et al., "S2-MLP: Spatial-Shift MLP Architecture for Vision," in Proc. of WACV'22. [3] D. Lian, et al., "AS-MLP: An Axial Shifted MLP Architecture for Vision," in Proc. of ICLR'22.

39 ▪ Attentionの代わりにpool operation！ ▪ （MetaFormer論文） PoolFormer W. Yu, et
al., "MetaFormer is Actually What You Need for Vision," in Proc. of CVPR’22.

40 ▪ 近年の階層的なVision Transformerを紹介した ▪ これらは共通の構造を持っており下記のモジュールから構成 ▪ Patch embedding：画像をパッチに分割しtoken化 ▪
Positional encoding：tokenに位置情報を付加 ▪ Patch merging：空間解像度を半分にし、チャネル数を増加させる ▪ Transformer block：token mixer (attention) とFFNによる特徴抽出 ▪ Transformer blockのattention部分の計算量削減がポイント ▪ Window (local) attentionとspatial-reduction attentionに大別される ▪ これらの組み合わせもある。1 blockで両方 or 連続したblockで個別に ▪ Position encodingはなくしてFFNにDWConvが良さそう（個人の意見です ▪ cls tokenはなくしてglobal average poolingを使う流れまとめ

41 まとめ（ICCV‘21, NeurIPS’21で流行、CVPR’22で完成？） Model Name Paper Title Published at Attention
Type HaloNet Scaling Local Self-Attention for Parameter Efficient Visual Backbones CVPR'21 overlapped window Swin Transformer Swin Transformer: Hierarchical Vision Transformer using Shifted Windows ICCV'21 window + shifted window PVT Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions ICCV'21 spatial reduction MViT Multiscale Vision Transformers ICCV'21 spatial reduction CvT CvT: Introducing Convolutions to Vision Transformers ICCV'21 spatial reduction Vision Longformer Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding ICCV'21 window + global token CrossViT CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification ICCV'21 global CeiT Incorporating Convolution Designs into Visual Transformers ICCV'21 global CoaT Co-Scale Conv-Attentional Image Transformers ICCV'21 factorized ResT ResT: An Efficient Transformer for Visual Recognition NeurIPS'21 spatial reduction Twins Twins: Revisiting the Design of Spatial Attention in Vision Transformers NeurIPS'21 window + spatial reduction Focal Transformer Focal Self-attention for Local-Global Interactions in Vision Transformers NeurIPS'21 window + spatial reduction CoAtNet CoAtNet: Marrying Convolution and Attention for All Data Sizes NeurIPS'21 global SegFormer SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers NeurIPS'21 spatial reduction TNT Transformer in Transformer NeurIPS'21 window + spatial reduction CrossFormer CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention ICLR'22 window + shuffle RegionViT RegionViT: Regional-to-Local Attention for Vision Transformers ICLR'22 window + regional token PoolFormer / MetaFormer MetaFormer is Actually What You Need for Vision CVPR’22 pool CSWin Transformer A General Vision Transformer Backbone with Cross-Shaped Windows CVPR’22 cross-shaped window Swin Transformer V2 Swin Transformer V2: Scaling Up Capacity and Resolution CVPR’22 window + shifted window MViTv2 MViTv2: Improved Multiscale Vision Transformers for Classification and Detection CVPR'22 spatial reduction Shunted Transformer Shunted Self-Attention via Multi-Scale Token Aggregation CVPR'22 spatial reduction Mobile-Former Mobile-Former: Bridging MobileNet and Transformer CVPR'22 global token Lite Vision Transformer Lite Vision Transformer with Enhanced Self-Attention CVPR'22 conv attention PVTv2 Improved Baselines with Pyramid Vision Transformer CVMJ'22 spatial reduction

42 Appendix

43 ▪ Self-attention自体は単なる集合のencoder ▪ Positional encodingにより系列データであることを教えている ▪ SwinではRelative Position Biasを利用
▪ Relativeにすることで、translation invarianceを表現 Relative Position Bias Window内の相対的な位置関係によって attention強度を調整（learnable）

44 ▪ 相対位置関係は縦横[−M + 1, M −1]のrangeで(2M-1)2パターン ▪ このbiasとindexの関係を保持しておき、使うときに引く実装

45 ▪ On Position Embeddings in BERT, ICLR’21 ▪ https://openreview.net/forum?id=onxoVA9FxMw
▪ https://twitter.com/akivajp/status/1442241252204814336 ▪ Rethinking and Improving Relative Position Encoding for Vision Transformer, ICCV’21. thanks to @sasaki_ts ▪ CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows, arXiv’21. thanks to @Ocha_Cocoa Positional Encodingの議論

46 ▪ 入力token依存、画像入力サイズに依存しない、translation- invariance、絶対座標も何となく加味できるposition encoding (PE) ▪ 実装は単に特徴マップを2次元に再構築してzero padding付きのconvするだけ ▪
Zero pad付きconvによりCNNが絶対座標を特徴マップに保持するという報告 [2] ▪ これにinspireされ、PVTv2ではFFNにDWConvを挿入、PE削除 Conditional Positional Encoding (CPE) [1] [1] X. Chu, et al., "Conditional Positional Encodings for Vision Transformers," in arXiv:2102.10882. [2] M. Islam, et al., "How Much Position Information Do Convolutional Neural Networks Encode?," in Proc. of ICLR'20.

近年のHierarchical Vision Transformer

近年のHierarchical Vision Transformer

More Decks by yu4u

Other Decks in Technology

Featured

Transcript