Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

[輪講] Vision Transformer Adapter for Dense Predi...

Avatar for Naoki Kato Naoki Kato
December 09, 2025

[輪講] Vision Transformer Adapter for Dense Predictions

Vision Transformerに視覚的帰納バイアスを注入することで密予測タスクへ効率的に適応させる手法である、ViT-Adapterについての輪講資料です。

論文:https://arxiv.org/abs/2205.08534

Avatar for Naoki Kato

Naoki Kato

December 09, 2025
Tweet

More Decks by Naoki Kato

Other Decks in Research

Transcript

  1. ࿦จ֓ཁ 7JTJPO5SBOTGPSNFS 7J5 ʹࢹ֮తؼೲόΠΞεΛ஫ೖ͠ ີ༧ଌλεΫ΁ޮ཰తʹదԠͤ͞Δ 7J5"EBQUFS ΛఏҊ ଟ͘ͷίϯϖ্Ґղ๏Ͱར༻͞Ε͍ͯΔɿ • TU1MBDF4PMVUJPOGPSUIFUI-4704$IBMMFOHF7JEFP*OTUBODF4FHNFOUBUJPO

    • $IBNQJPOTPMVUJPOJO5SBDL %0DDVQBODZ1SFEJDUJPO PGUIF$713 "VUPOPNPVT%SJWJOH$IBMMFOHF • $IBNQJPOTPMVUJPOJOUIF7JEFP4DFOF1BSTJOHJOUIF8JME$IBMMFOHFBU$713 %*/0W <4JNÉPOJ > Ͱ΋ྖҬ෼ׂλεΫʹ࠾༻ʢޙड़ͷ 'FBUVSF*OKFDUPS͸আ֎ʣ બఆཧ༝ɿ՝୊ఏى ղܾࡦͱ΋ʹཧʹద͓ͬͯΓษڧʹͳΔྖҬ෼ׂͰڧ͍ 
  2. 7J5"EBQUFS ࣄલֶश͞Εͨ 7J5ʹΞμϓλʔΛ෇Ճͯ͠ԼྲྀλεΫͰֶश • ϚϧνϞʔμϧσʔλͰࣄલֶश͞Εͨڧྗͳ 7J5Λ׆༻Ͱ͖Δ • ΞμϓλʔͰࢹ֮తؼೲόΠΞεΛ஫ೖ  Step1:

    Image Modality Pre-training Step2: Fine-tuning (a) Previous Paradigm Vision-Specific Transformer COCO ADE20K Det/Seg Vision-Specific Transformer Image Modality SL/SSL Step1: Multi-Modal Pre-training Step2: Fine-tuning with Adapter (b) Our Paradigm ViT Image, Video, Text, ... SL/SSL COCO ADE20K ViT Adapter Det/Seg
  3. طଘͷ 7J5׆༻ख๏ͱͷࠩ෼ ೖྗը૾৘ใΛ༻͍ͨ 7J5ͱͷ૒ํ޲ͷ৘ใ఻ୡΛ௨ͯ͠ ಛ௃ϐϥϛουΛऔಘ  ViT 1/32 up 4x

    1/16 up 2x 1/8 1x 1/4 down 2x (a) Previous Method (Li et al., ViTDet) ViT Adapter (b) ViT-Adapter (ours) Image Image 1/32 1/16 1/8 1/4 using task prior using task prior & input 7J5͸ύονຒΊࠐΈͷ࣌఺Ͱۭؒ৘ใ͕ଛͳΘΕ͍ͯͦ͏ͳͷͰ߹ཧతʹࢥ͑Δ
  4. ΞʔΩςΫνϟશମ 7J5ʹ 4QBUJBM1SJPS.PEVMF 'FBUVSF*OKFDUPS&YUSBDUPSΛ෇Ճ • ࠷ऴ &YUSBDUPSͷϚϧνεέʔϧಛ௃Λ༧ଌʹར༻  Patch Embedding

    … Spatial Prior Module (a) Vision Transformer (ViT) (b) ViT-Adapter Block N Injector N Block 2 Extractor 2 Injector 2 Extractor 1 Block 1 Injector 1 Det Seg ... Position Embedding Element-wise Addition (c) Spatial Prior Module ℱ!" # Stem ℱ# ℱ$ ℱ% Extractor N (d) Spatial Feature Injector " Key/ Value ℱ!" & Query ℱ'() & Cross-Attention (e) Multi-Scale Feature Extractor " Key/Value Query ℱ!" & ℱ'() &*# ℱ # !" & Norm Cross-Attention FFN ℱ!" &*# ℱ!" & ℱ # '() &
  5. 4QBUJBM1SJPS.PEVMF 41. ύονຒΊࠐΈͱ͸ผʹ ೖྗը૾͔Βہॴతͳۭؒ৘ใΛநग़ εςοϓɿ  4UFNɿDPOW૚ NBYQPPM  TUSJEFͷ

    DPOWͰಛ௃Ϛοϓͷղ૾౓Λ ஈ֊తʹॖখ  ºDPOWͰ 7J5ͷຒΊࠐΈ࣍ݩͱἧ͑Δ  Ϛϧνεέʔϧͷಛ௃ྔΛऔಘ 
  6. ఆੑ݁Ռ PO"%&, ಛ௃Ϛοϓͷࡉ෦දݱ͕վળ  Segmentation Results Stride-4 Feature Map Stride-8

    Feature Map Stride-16 Feature Map Stride-32 Feature Map ViT ViT-Adapter-B ViT-B ViT-Adapter-B ViT ViT-Adapter-B ViT-B ViT-Adapter-B ViT-B
  7. ֤ίϯϙʔωϯτͷ༗ޮੑΛ֬ೝ "CMBUJPO4UVEZ  Patch Embedding … Spatial Prior Module (b)

    ViT-Adapter Block N Injector N Block 2 Extractor 2 Injector 2 Extractor 1 Block 1 Injector 1 Seg ... sition mbedding ement-wise ddition (c) Spatial Prior Module ℱ!" # Stem ℱ# ℱ$ ℱ% Extractor N (d) Spatial Feature Injector " Key/ Value ℱ!" & Query ℱ'() & Cross-Attention (e) Multi-Scale Feature Extractor " Key/Value Query ℱ!" & ℱ'() &*# ℱ # !" & Norm Cross-Attention FFN ℱ!" &*# ℱ!" & ℱ # '() &
  8. ·ͱΊ • 7J5΁ࢹ֮తؼೲόΠΞεΛಋೖ͢ΔύϥμΠϜΛఏҊ • 7J5ͷ൚༻ੑͱڧྗͳࣄલֶशදݱΛ׆༻Ͱ͖Δ • 7J5ͷΞʔΩςΫνϟΛมߋ͠ͳ͍λεΫదԠϞδϡʔϧ܈Λઃܭ • 4QBUJBM1SJPS.PEVMFɿہॴۭؒ৘ใͷऔಘ •

    'FBUVSF*OKFDUPS&YUSBDUPSɿۭؒ৘ใͷ஫ೖɾେҬಛ௃ͷநग़ • ௥Ճύϥϝʔλ͕গͳ͘ޮ཰తͳసҠֶश͕Մೳ • $0$0ͱ "%&,Ͱ౰࣌ͷ࠷ઌ୺ੑೳΛୡ੒ • ॴײɿ՝୊ఏىͱղܾͷํ޲ੑ͕ద੾Ͱ͋Ε͹ جຊతͳߏ੒ཁૉ ͷ૊Έ߹Θ͚ͤͩͰྑ͍Ξ΢τϓοτ͕ग़ͤΔʢͦΕ͕೉͍͕͠ʣ