Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AIによる画像認識技術の進化 -25年の技術変遷を振り返る-

AIによる画像認識技術の進化 -25年の技術変遷を振り返る-

2025年5月23日
人とくるまのテクノロジー展2025:JSAE企画講演
藤吉弘亘(中部大学)

Avatar for Hironobu Fujiyoshi

Hironobu Fujiyoshi

May 22, 2025
Tweet

More Decks by Hironobu Fujiyoshi

Other Decks in Research

Transcript

  1. ࣗݾ঺հɿ౻٢߂࿱ GVKJZPTIJ!GTDDIVCVBDKQ 2 த෦େֶϩΰ த෦େֶϩΰ  ֶྺɿ ೥ذೆ޻ۀߴߍిࢠՊଔۀ ೥த෦େֶిࢠ޻ֶՊଔۀ ೥த෦େֶେֶӃम࢜՝ఔमྃ

    ೥த෦େֶେֶӃത࢜ޙظ՝ఔຬظୀֶʢത࢜ʣ ݚڀ׆ಈɿ ೥ถΧʔωΪʔϝϩϯେֶϩϘοτ޻ֶݚڀॴϙευΫݚڀһʢ೥ʣ ೥த෦େֶ޻ֶ෦ߨࢣ ೥த෦େֶ޻ֶ෦।ڭत ೥ถΧʔωΪʔϝϩϯେֶϩϘοτ޻ֶݚڀॴ٬һݚڀһʢ೥ʣ ೥த෦େֶ޻ֶ෦ڭत ~  ೥ػց஌֮ϩϘςΟΫεݚڀάϧʔϓ ݱࡏʹࢸΔ ֶ֎׆ಈɿ ೔ຊσΟʔϓϥʔχϯάڠձཧࣄ ΫϩεΞϙΠϯτϝϯτʢσϯιʔʣ vol.162 ౻٢߂࿱ʢத෦େֶʣʮ“+AI”ͰมΘΔະདྷʯ IUUQTXXXZPVUVCFDPNXBUDI W[K&0J7)6 ౻٢߂࿱ ஑ాӯࣿ͞Μ (೫໦ࡔ46) ઒ాेເ͞Μ
  2. ը૾ೝٕࣝज़ͷมભʢ࢛൒ੈلʣ  าߦऀݕग़ )0( 47.   ޯ഑ํ޲ώετάϥϜ إݕग़ )BSSMJLF

    "EB#PPTU   CPYϑΟϧλʹΑΔ໌҉ࠩ ը૾Ϛονϯά 4*'5   εέʔϧෆม ಛ௃఺ݕग़ɾهड़ Ϋϥεͷը૾෼ྨ 4*'5 #0'   #BHදݱͷಋೖ ಛఆ෺ମೝࣝ ը૾෼ྨ ෺ମݕग़ ηϚϯςΟοΫηάϝϯςʔγϣϯ खॻ͖਺ࣈͷ෼ྨ $//   ৞ΈࠐΈχϡʔϥϧωοτϫʔΫ ϐΫηϧࠩ෼ 3BOEPN'PSFTU   ϐΫηϧࠩ෼ʹΑΔςΫενϟ 463'   ੵ෼ը૾ʹΑΔߴ଎Խ '"45   ܾఆ໦ʹΑΔίʔφʔݕग़ 03#   ڭࢣͳֶ͠शʹΑΔϖΞબ୒ 'JTIFS7FDUPS   ֬཰ີ౓ؔ਺ʹΑΔಛ௃දݱ 7-"%   ؔ࿈͢Δ78ͷಛ௃ +PJOU)0(   $P)0(   )0(ͷڞىදݱ #3*&'   ೋ஋ಛ௃      $"3%   ಛ௃ྔͷೋ஋Խ 5FYUPO   ϑΟϧλόϯΫ $)"-$   ہॴࣗݾ૬ؔ Ϋϥϥε෼ྨ "MFY/FU   ৞ΈࠐΈχϡʔϥϧωοτϫʔΫ ೥୅ 7((   ૚ (PPH-F/FU   ૚ 3FT/FU   ૚ʴ࢒ࠩ઀ଓ ଟΫϥε෺ମݕग़ 'BTUFS3$//   3FHJPO1SPQPTBM :0-0   4JOHMFTIPU 44%   4JOHMFTIPU '$/   ৞ΈࠐΈʹΑΔηάϝϯςʔγϣϯ 141/FU   1ZSBNJE1PPMJOH .%FU   ϚϧνϨϕϧಛ௃ϐϥϛου 7J5   7JTJPO5SBOTGPSNFS 4&/FU   &YDJUBUJPO %*/0  ."&   44- 7J5 4JN$-3   ରরֶश .P$P   ରরֶश 4FH/FU   &ODPEFSEFDPEFS %FFQ-BCW   "USPVT$POWPMVUJPO 4FH'PSNFS   7J5 %&53   5SBOTGPSNFS '1/   ಛ௃ϐϥϛου 6/FU   '$/Λར༻ & ffi DJFOU/FU   /"4 $// 7JTJPO5SBOTGPSNFS 4VQFS(MVF   (//ͷར༻ $FOUFS/FU   ΞϯΧʔϨε ࣗݾڭࢣ͋Γֶश  $-*1   ݴޠͱը૾ͷ ΞϥΠϝϯτ :0-08PSME   ΦʔϓϯϘΩϟϒϥϦ ෺ମݕग़ 074FH   ΦʔϓϯϘΩϟϒϥϦ ηϚηά %"--&   ը૾ੜ੒ ࣗݾڭࢣ͋Γֶशɾը૾ੜ੒ɾϚϧνϞʔμϧ 4".   ج൫Ϟσϧ (FNJOJ   --B7"   7JTJPO-BOHVBHFNPEFM Πϯελϯεηάϝϯςʔγϣϯ .BTL3$//   &OEUPFOEͰ࣮ݱ -P'53   5SBOTGPSNFSʹΑΔ&& &.."   &&ࣗಈӡస (SPPU/   7JTJPO-BOHVBHF"DUJPONPEFM #&35  (15   -BSHF-BOHVBHFNPEFM ڭࢣ͋Γֶश "#/   ਓͷ஌ݟͷ૊ΈࠐΈ -*'5   $//ͷར༻
  3. าߦऀݕग़ )0( 47.   ޯ഑ํ޲ώετάϥϜ إݕग़ )BSSMJLF "EB#PPTU 

     CPYϑΟϧλʹΑΔ໌҉ࠩ ը૾Ϛονϯά 4*'5   εέʔϧෆม ಛ௃఺ݕग़ɾهड़ Ϋϥεͷը૾෼ྨ 4*'5 #0'   #BHදݱͷಋೖ ಛఆ෺ମೝࣝ ը૾෼ྨ ෺ମݕग़ ηϚϯςΟοΫηάϝϯςʔγϣϯ खॻ͖਺ࣈͷ෼ྨ $//   ৞ΈࠐΈχϡʔϥϧωοτϫʔΫ ϐΫηϧࠩ෼ 3BOEPN'PSFTU   ϐΫηϧࠩ෼ʹΑΔςΫενϟ 463'   ੵ෼ը૾ʹΑΔߴ଎Խ '"45   ܾఆ໦ʹΑΔίʔφʔݕग़ 03#   ڭࢣͳֶ͠शʹΑΔϖΞબ୒ 'JTIFS7FDUPS   ֬཰ີ౓ؔ਺ʹΑΔಛ௃දݱ 7-"%   ؔ࿈͢Δ78ͷಛ௃ +PJOU)0(   $P)0(   )0(ͷڞىදݱ #3*&'   ೋ஋ಛ௃      $"3%   ಛ௃ྔͷೋ஋Խ 5FYUPO   ϑΟϧλόϯΫ $)"-$   ہॴࣗݾ૬ؔ Ϋϥϥε෼ྨ "MFY/FU   ৞ΈࠐΈχϡʔϥϧωοτϫʔΫ ೥୅ 7((   ૚ (PPH-F/FU   ૚ 3FT/FU   ૚ʴ࢒ࠩ઀ଓ ଟΫϥε෺ମݕग़ 'BTUFS3$//   3FHJPO1SPQPTBM :0-0   4JOHMFTIPU 44%   4JOHMFTIPU '$/   ৞ΈࠐΈʹΑΔηάϝϯςʔγϣϯ 141/FU   1ZSBNJE1PPMJOH .%FU   ϚϧνϨϕϧಛ௃ϐϥϛου 7J5   7JTJPO5SBOTGPSNFS 4&/FU   &YDJUBUJPO %*/0  ."&   44- 7J5 4JN$-3   ରরֶश .P$P   ରরֶश 4FH/FU   &ODPEFSEFDPEFS %FFQ-BCW   "USPVT$POWPMVUJPO 4FH'PSNFS   7J5 %&53   5SBOTGPSNFS '1/   ಛ௃ϐϥϛου 6/FU   '$/Λར༻ & ffi DJFOU/FU   /"4 $// 7JTJPO5SBOTGPSNFS 4VQFS(MVF   (//ͷར༻ $FOUFS/FU   ΞϯΧʔϨε ࣗݾڭࢣ͋Γֶश  $-*1   ݴޠͱը૾ͷ ΞϥΠϝϯτ :0-08PSME   ΦʔϓϯϘΩϟϒϥϦ ෺ମݕग़ 074FH   ΦʔϓϯϘΩϟϒϥϦ ηϚηά %"--&   ը૾ੜ੒ ࣗݾڭࢣ͋Γֶशɾը૾ੜ੒ɾϚϧνϞʔμϧ 4".   ج൫Ϟσϧ (FNJOJ   --B7"   7JTJPO-BOHVBHFNPEFM Πϯελϯεηάϝϯςʔγϣϯ .BTL3$//   &OEUPFOEͰ࣮ݱ -P'53   5SBOTGPSNFSʹΑΔ&& &.."   &&ࣗಈӡస (SPPU/   7JTJPO-BOHVBHF"DUJPONPEFM #&35  (15   -BSHF-BOHVBHFNPEFM ڭࢣ͋Γֶश "#/   ਓͷ஌ݟͷ૊ΈࠐΈ -*'5   $//ͷར༻ ը૾ೝٕࣝज़ͷมભʢ࢛൒ੈلʣ  ୈੈ୅ɿϋϯυΫϥϑτಛ௃
  4. ରԠ఺୳ࡧʹ͓͚Δෆมੑ 6 த෦େֶϩΰ த෦େֶϩΰ ΩʔϙΠϯτݕग़ ʴ ΞϑΟϯྖҬਪఆ ΩʔϙΠϯτݕग़ ʴ ౳ํੑεέʔϧਪఆ

    ΩʔϙΠϯτݕग़ ճసมԽ εέʔϧมԽ ࣹӨมԽ  ख๏ɿ)BSSJT '"45 ख๏ɿ4*'5 463' ख๏ɿ)FTTJBO" ff i OF .4&3
  5. w ಛ௃఺ʢΩʔϙΠϯτʣͷݕग़ͱಛ௃ྔͷهड़ %P(ʹΑΔεέʔϧͱΩʔϙΠϯτݕग़ ޯ഑ํ޲ώετάϥϜʹΑΔಛ௃هड़ 4*'5 4DBMF*OWBSJBOU'FBUVSF5SBOTGPSN <-PXF *$$7> 7 த෦େֶϩΰ

    த෦େֶϩΰ  εέʔϧͱΩʔϙΠϯτݕग़ εέʔϧෆมੑͷ֫ಘ ΩʔϙΠϯτͷϩʔΧϥΠζ ϊΠζʹର͢Δؤ݈ੑΛ֫ಘ ˠ ˠ ً౓มԽʹର͢Δؤ݈ੑΛ֫ಘ ˠ ΦϦΤϯςʔγϣϯͷࢉग़ ճసෆมੑͷ֫ಘ ಛ௃ྔͷهड़ ˠ ΩʔϙΠϯτͷ࠲ඪͱεέʔϧ ΦϦΤϯςʔγϣϯ ࣍ݩͷಛ௃ϕΫτϧ
  6. 4*'5ɿεέʔϧͱΩʔϙΠϯτݕग़ 8 த෦େֶϩΰ த෦େֶϩΰ  w %P( %J ff FSFODFPG(BVTTJBO

    εέʔϧεϖʔε͔ΒεέʔϧͱΩʔϙΠϯτΛݕग़ ฏ׈Խը૾ %P(ը૾ - - - - - - - - 4DBMF ΦΫλʔϒ 4DBMF ̎ΦΫλʔϒ € σ0 € 4σ0 € 2σ0 € 2σ0 ΩʔϙΠϯτຖʹεέʔϧΛࢉग़ˠεέʔϧʹର͢ΔෆมੑΛ֫ಘʢεέʔϧෆมͳಛ௃ྔʣ
  7. 4*'5ɿޯ഑ํ޲ώετάϥϜʹΑΔಛ௃هड़ 9 த෦େֶϩΰ த෦େֶϩΰ  w ΩʔϙΠϯτຖʹը૾ہॴಛ௃ྔΛهड़  ޯ഑ํ޲ώετάϥϜ͔ΒΦϦΤϯςʔγϣϯΛࢉग़ 

    εέʔϧʹରԠͨ͠ಛ௃هड़΢Οϯυ΢ΛΦϦΤϯςʔγϣϯํ޲ʹճస  ॎԣ̐෼ׂͨ͠ϒϩοΫ͝ͱʹํ޲ͷޯ഑ํ޲ώετάϥϜΛಛ௃ྔͱͯ͠هड़ ʢϒϩοΫYํ޲ˠ࣍ݩͷಛ௃ྔʣ ෼ׂ ෼ׂ ํ޲ͷޯ഑ώετάϥϜ ಛ௃هड़΢Οϯυ΢ ճసͱεέʔϧʹෆมͳಛ௃ྔͷهड़͕Մೳ ΦϦΤϯςʔγϣϯ ಛ௃هड़΢Οϯυ΢
  8. 4*'5Ҏ߱ͷΞϓϩʔν 10 த෦େֶϩΰ த෦େֶϩΰ  w ΩʔϙΠϯτݕग़  ߴ଎Խɿ463'<&$$7> 

    ίʔφʔʹಛԽɿ'"45<&$$7> w ಛ௃هड़  ߴ଎Խɿ463'<&$$7>  ೋ஋ಛ௃ɿ #3*&'<&$$7> 03#<&$$7>  #3*4,<*$$7> $"3%<*$$7> 画像技術の最前線 25勾配特徴抽出技術* ―SIFT 以降のアプローチ― TREKCDnt-bEsDK ImEgD LocEl FDEtuRDs 藤吉弘亘** 安倍 満*** HCRonobu FUJIYOSHI EnK MCtsuRu AMBAI Key words CmEgD locEl fDEtuRD, SIFT, SURF, FAST, RIFF, BRIEF, BRISK, ORB, CARD nc- f a u ritFuTsfom()1はf,定物体認識だけで く合成分類 ど多ア成プリ成多シ ョ類成シンプ類 利プ成多用されプていくどョ利.処.理過定程キポ イト検と記理,述ri二段mri階か,らり述t各以下u T流2ロ1カラズ3オエでテくどョ利 t算4D-理過GT) よロ位置だ内定物体勾配t情報求め周,辺過領算4理域パ tッ3内,でテ GT)よロ位置だ ] 処テFuTs内GT)よロ位置だ チテGT)よロ位t方T向せよこ 定物体勾配 ] 対テし下おロ生T流2ロtヒだ グテ定物体t勾配 GT)よロ位置だ算4記理過ムリff類プ類多合類計れさ計コ成ン用用リ成多 いムれコが算41高辺GT)よロ位tFuTs内う問認置だ けでテ定物体勾配記理過FuTs題tあ解決速め周し下お ロ生T流2ロ認手法過GT)よロ位06年UいR提案が認 し下おロ生T流2ロた積1()ズBオ定物体認勾配けでx 内記過()1フィオはf,定物体認識だけでテくどョ利 記 理過GT)よロ位置だ算41タとで ムれコ rit比段m過 定物体勾配算41タとであ解較F位約せ倍ヒだt実ヒ現F 位近性エ内エ能PC近携でテxtPC認帯端けで性末ot 等小内ィオ過チ型型デ バ1 く考慮ョチ.近び省ズ3メテく考慮ョ 記 理過領算41タエオモ階ri認カラィメ 降れ遷 示ナsつ認 ラエでx内記過くどョ利 内A後ィオ組 処型 みt性末o認わ使 ィメテ同バ記理過性様基, づ8 検と記理,述次元ベク別t 力も反面よF記tカラ認消費ィ過GT)よロ位置だ内定物 体勾配t領算4認性末oタ高代ド直接下oィメ等小近び省 ズ3オエでテs n 1過GT)よロ位置だ内定物体勾配1 タとで くどョ利 域Etf5認KけテGT)よロ位置だ算4記 理過現TOT1定oけでx内記性末め7ド直接下認わ使ィ メ ョ間く利対.近び省ズ3メテョ間く利 理過的配t定物体勾配等 小内時展二開Bオ本ラズ3でテ定物体勾配t算41タエオ 理過くどョ利 m く考慮ョ 内稿き1あ解定物体1ゴ説述 慮どョョグ. 近 チ型処型 バ1び省ズ3メテくどョ利 記理 処チ複 数中過く考慮ョ 記 理 デグ 数中過慮どョョ 記理 処型型 数中t心行位s近識だズ3でテ 性数中t心行位s定物体理過性エト範基囲認表7作上過直 接下保持体近り述過チ型処型 バ域E記理心行位s定物体t要 開辺1面よO下現T点記定物体認勾配けで等小近び省ズ3 オエでテR提案め周面よO下現T点認d接比段けで等小内 ィオ 降慮どrョi.過降慮どくtデ.過F慮降u.近過T接s1面よO下現 T点認比段けで等小内ィオ 8間慮ム複.近び省ズ3メテxt高 能1過くどョ利 内 く考慮ョ 域E記理過GT)よロ位置だタ高代 定物体勾配1タエオ過性末o内ド直接下o認稿f1わ使け で等小近omズ3オエでテ ()記理過くどョ利 m く考慮ョ 域Et各以方T案近過GT) よロ位置だ内定物体勾配t領算4タエオ過らt高能1om ズ3オ1メめ認領等小t各sは下こ倍内内表1帯,けでテ ScURFpdbt3 くどョ利 記理過定物t ムれコ riめ周GT)よロ位認置だ けでt内稿f1過GT)よロ位認体認内ィメ定物体勾配認 識能だけ認でけFuTs表置だけでテムれコ ritく段理 実ヒ現F位近性エ合過定物t ムれコ ri認成分けでメ法t 直接下認類けで内エ能PCど近携でテGT)よロ位置だt 性末o内ィオ過く考慮ョ 記理モ階ri認カラィメ 降れ遷 示ナ sつ認ラエ性末o認わ使ィメテr多丁成プ多 周近び省ィメ 精密工学会誌 Vol.77, No.12, 2011 1109 目番)号ワ 9段 チ対 バ 処型 院 対 博 目目体士期課修課士決速修課了い∼米ネギ博研究所 (講 処チ型型が 目目目師准.反ロ教T各よ生ナTせ授せ位下い経客員機 視覚情動・ 処i 従 処 事機視行方Fつ度T チi 求が ATml 処賞賞u バ体士期課期課会論文的誌ピ-ュビテ論文 い修課が テ処賞賞uジチ型型型 バ優向T秀山T直方ロ期課 方授提位修課念電子 づれ用シ多れ合シれプ成分 ョ類分分れ丁テチ型型型 バ 体 士 期 課 通 信過チ型型グ バ 稿 期 気 慶 應 認 義 オ チ型処型 バ 高 辺 稿 期 慶 應テ チ型型iジチ型型デ バ優向T秀山T直方ロ期課方授提位修課念電子塾式念電式テ チ型処型 バ実ヒ社ニセ過ウri算4過RつTロイトM4帯t念電1ブ象テチ型型i バ章方授向提以念電べテチ型型賞 バ章決速算4課極値探索現ロLpTつ似ま2 ロ内よ直Tま直反ナ各ず異値探べテチ型型賞 バ章ガパ勾関念電べテ決速算4課 極過入畳決速込ん課極過入滑課極過どrrr 領極式テ g4 5 チ型型u バ差枚隣注期課期課会論文的誌ピ-ュビテ論文い修課が テチ型型u バ准素 極傍反ロ教T各よ生ナTせ授せ位下流場各おロま場各テチ型処処 バri候ロ流 ロ約流ロ)ま補倍いくくどどがしT反ナおロFべテチ型処処 バritイトM4帯流 ロ)ま補倍い含ど慮考がよロつせ行生ナノ候提流2ロべテRつTロイトM4 帯過現ロLpTつ似ま2ロt念電1ブ象テ入畳決速込ん課極過どrrr 領 極式テ ౻٢ ҆ഒɿzہॴޯ഑ಛ௃நग़4*'5Ҏ߱ͷΞϓϩʔνz ਫ਼ີ޻ֶձࢽ೥݄ר߸QQ
  9. ೋ஋ಛ௃ͷޮՌ 11 த෦େֶϩΰ த෦େֶϩΰ  w ڑ཭ܭࢉ·ͰؚΊͨॲཧ଎౓Λൺֱ w ೋ஋ಛ௃ͷར఺ 

    େ෯ͳলϝϞϦԽ͕Մೳ  ϋϛϯάڑ཭ʹΑΔߴ଎ͳڑ཭ܭࢉɹˠ44&֦ு໋ྩͰߴ଎ʹԋࢉՄೳ 4*'5ʢϢʔΫϦουڑ཭ʣ 463'ʢϢʔΫϦουڑ཭ʣ ೋ஋ಛ௃ʢ'"45 #3*&' ϋϛϯάڑ཭ʣ ೋ஋ಛ௃ʢ'"45 03# ϋϛϯάڑ཭ʣ ॲཧ࣌ؒ<T> ΩʔϙΠϯτݕग़ ಛ௃ྔهड़ ڑ཭ܭࢉ ʢ44&ɿ*OUFMࣾͷϚΠΫϩϓϩηοαʢ$16.16ʣʹ಺ଂ͞Ε֦ͨு໋ྩηοτʣ 03# #3*&' ೋ஋ಛ௃
  10. w ʮಛ௃఺ͷݕग़ˠهड़ˠϚονϯάʯΛਂ૚ֶशϞσϧʹΑΓ&OEUP&OEͰֶश  -P'53<4VO $713> ೥Ҏ߱ͷΞϓϩʔνɿਂ૚ֶशͷద༻  -P'53 4VQFS(MVF ैདྷ๏

    4VQFS(MVF ͱͷൺֱ <latexit sha1_base64="HysyoDDQ8gG0PfCEBIBLGTDKrqQ=">AAAC4nicjVHLSsNAFD3GV62vqksRgkVwVRJBdCkWxIUUBfsAW8pkOq3BNAmTiSilK3fuxK0/4FY/RvwD/QvvjBF8IDohyZlz7zkz914vDvxEOc7ziDU6Nj4xmZvKT8/Mzs0XFhZrSZRKLqo8CiLZ8FgiAj8UVeWrQDRiKVjfC0TdOyvreP1cyMSPwmN1GYtWn/VCv+tzpohqF1aaSlworztwS/ZBxFlg7wmmUinscqUybBeKTskxy/4J3AwUka3DqPCEJjqIwJGiD4EQinAAhoSeE7hwEBPXwoA4Scg3cYEh8qRNKUtQBiP2jL492p1kbEh77ZkYNadTAnolKW2skSaiPElYn2abeGqcNfub98B46rtd0t/LvPrEKpwS+5fuI/O/Ol2LQhfbpgafaooNo6vjmUtquqJvbn+qSpFDTJzGHYpLwtwoP/psG01iate9ZSb+YjI1q/c8y03xqm9JA3a/j/MnqG2U3M2Sc7RR3NnNRp3DMlaxTvPcwg72cYgqeV/hHg94tDrWtXVj3b6nWiOZZglflnX3BtsZmoY=</latexit> 1. Local Feature CNN <latexit sha1_base64="vJMgTC04/pe/nXyxUdFKMCmxugU=">AAACynicjVHLSsNAFD2Nr1pfVZdugkVwISURRZe1biy4qGAfUFtJptMamheTiVCKO3/ArX6Y+Af6F94ZI6hFdEKSM+eec2fuvW7se4m0rJecMTM7N7+QXywsLa+srhXXN5pJlArGGyzyI9F2nYT7Xsgb0pM+b8eCO4Hr85Y7OlXx1i0XiReFl3Ic827gDENv4DFHEtWq9U72ar3qdbFklS29zGlgZ6CEbNWj4jOu0EcEhhQBOEJIwj4cJPR0YMNCTFwXE+IEIU/HOe5QIG9KKk4Kh9gRfYe062RsSHuVM9FuRqf49ApymtghT0Q6QVidZup4qjMr9rfcE51T3W1MfzfLFRArcUPsX75P5X99qhaJAY51DR7VFGtGVceyLKnuirq5+aUqSRli4hTuU1wQZtr52WdTexJdu+qto+OvWqlYtWeZNsWbuiUN2P45zmnQ3C/bh2Xr4qBUqWajzmML29ileR6hgjPU0dBVPuART8a5IYyxMfmQGrnMs4lvy7h/B/EskUI=</latexit> IA, IB <latexit sha1_base64="LTlZo7p0bI0j1k3e8BPEOnIGqhk=">AAACDnicbZDNSsNAEMcnftb60ahHL8EieCpJQeyx6MVjBfsBbSmb7aRdutmE3Y1QQt7Bu1d9BW/i1VfwDXwMt20E2/qHgT//mWGGnx9zprTrflkbm1vbO7uFveL+weFRyT4+aakokRSbNOKR7PhEIWcCm5ppjp1YIgl9jm1/cjvrtx9RKhaJBz2NsR+SkWABo0SbaGCXeoJRDCShqZeltWxgl92KO5ezbrzclCFXY2B/94YRTUIUmnKiVNdzY91PidSMcsyKvURhTOiEjLBrrCAhqn46fzxzLkwydIJImhLamad/N1ISKjUNfTMZEj1Wq71Z+F+vm+ig1k+ZiBONgi4OBQl3dOTMKDhDJpFqPjWGUMnMrw4dE0NBG1ZLV37xZEWDxlsFsW5a1Yp3VXHvq+X6TQ6pAGdwDpfgwTXU4Q4a0AQKCTzDC7xaT9ab9W59LEY3rHznFJZkff4A2EacaQ==</latexit> 1/8 <latexit sha1_base64="8qAoK4O1KPUlff4rINa8ECB6M6U=">AAACxHicjVHLSsNAFD2Nr1pfVZdugkVwVRJRdFkUxGUL9gFaJJlO69A0CZOJUIr+gFv9NvEP9C+8M05BLaITkpw5954zc+8N00hkyvNeC87c/MLiUnG5tLK6tr5R3txqZUkuGW+yJEpkJwwyHomYN5VQEe+kkgejMOLtcHim4+07LjORxJdqnPLuKBjEoi9YoIhq+Dflilf1zHJngW9BBXbVk/ILrtFDAoYcI3DEUIQjBMjouYIPDylxXUyIk4SEiXPco0TanLI4ZQTEDuk7oN2VZWPaa8/MqBmdEtErSelijzQJ5UnC+jTXxHPjrNnfvCfGU99tTP/Qeo2IVbgl9i/dNPO/Ol2LQh8npgZBNaWG0dUx65Kbruibu1+qUuSQEqdxj+KSMDPKaZ9do8lM7bq3gYm/mUzN6j2zuTne9S1pwP7Pcc6C1kHVP6p6jcNK7dSOuogd7GKf5nmMGi5QR9N4P+IJz865EzmZk3+mOgWr2ca35Tx8AMO/jzo=</latexit> 1 <latexit sha1_base64="nEYe5yr7/RvgwRCNItFzePeULbE=">AAAC2XicjVHLSsNAFD2Nr1pf8bFzEyyCG0tSEF1W3bhwUdE+oJaSpNMazIvJRKjFhTtx6w+41R8S/0D/wjtjCmoRnZDkzLn3nJl7rxP7XiJM8zWnTUxOTc/kZwtz8wuLS/rySj2JUu6ymhv5EW86dsJ8L2Q14QmfNWPO7MDxWcO5PJTxxhXjiReFZ2IQs3Zg90Ov57m2IKqjr50yv7e9LwQLJWEc2wPGO3rRLJlqGePAykAR2apG+gvO0UUEFykCMIQQhH3YSOhpwYKJmLg2hsRxQp6KM9ygQNqUshhl2MRe0rdPu1bGhrSXnolSu3SKTy8npYFN0kSUxwnL0wwVT5WzZH/zHipPebcB/Z3MKyBW4ILYv3SjzP/qZC0CPeypGjyqKVaMrM7NXFLVFXlz40tVghxi4iTuUpwTdpVy1GdDaRJVu+ytreJvKlOycu9muSne5S1pwNbPcY6Derlk7ZTMk3KxcpCNOo91bGCL5rmLCo5QRY28r/GIJzxrLe1Wu9PuP1O1XKZZxbelPXwA336XgA==</latexit> Self-Attention Layer <latexit sha1_base64="FKqZnrEocY6yDqcbtjrp2fWEyhA=">AAAC2nicjVHLSsNAFD2Nr1pfVXHlJlgEN5a0ILqsduPCRQX7AFtKkk5raF7MTIRS3LgTt/6AW/0g8Q/0L7wzpqAW0QlJzpx7z5m59zqx7wlpWa8ZY2Z2bn4hu5hbWl5ZXcuvbzRElHCX1d3Ij3jLsQXzvZDVpSd91oo5swPHZ01nWFXx5jXjwovCCzmKWSewB6HX91xbEtXNb1V5JMT+sZQsVIx5Zo8Y7+YLVtHSy5wGpRQUkK5alH9BGz1EcJEgAEMISdiHDUHPJUqwEBPXwZg4TsjTcYYb5EibUBajDJvYIX0HtLtM2ZD2ylNotUun+PRyUprYJU1EeZywOs3U8UQ7K/Y377H2VHcb0d9JvQJiJa6I/Us3yfyvTtUi0ceRrsGjmmLNqOrc1CXRXVE3N79UJckhJk7hHsU5YVcrJ302tUbo2lVvbR1/05mKVXs3zU3wrm5JAy79HOc0aJSLpYOidV4uVE7SUWexjR3s0TwPUcEpaqiT9xiPeMKz0TZujTvj/jPVyKSaTXxbxsMHPyWYCg==</latexit> Cross-Attention Layer <latexit sha1_base64="9FwO3bbzTqyFgc3cGvYDb/Iep3A=">AAAC1nicjVHLSsNAFD2Nr1pfqS7dBIvgqqSC6FJ047KCfUAtkkyndWheTCY+KLoTt/6AW/0k8Q/0L7wzpqAW0QlJzpx7z5m59/pJIFLluq8Fa2p6ZnauOF9aWFxaXrHLq800ziTjDRYHsWz7XsoDEfGGEirg7URyL/QD3vKHhzreuuAyFXF0oq4T3g29QST6gnmKqDO7zOKoL3o8YtwJPSXF1ZldcauuWc4kqOWggnzVY/sFp+ghBkOGEBwRFOEAHlJ6OqjBRUJcFyPiJCFh4hw3KJE2oyxOGR6xQ/oOaNfJ2Yj22jM1akanBPRKUjrYJE1MeZKwPs0x8cw4a/Y375Hx1He7pr+fe4XEKpwT+5dunPlfna5FoY89U4OgmhLD6OpY7pKZruibO1+qUuSQEKdxj+KSMDPKcZ8do0lN7bq3nom/mUzN6j3LczO861vSgGs/xzkJmtvV2k7VPd6u7B/koy5iHRvYonnuYh9HqKNB3pd4xBOerbZ1a91Z95+pViHXrOHbsh4+AIRXlps=</latexit> confidence matrix <latexit sha1_base64="sPL2CbpmuH1PY0evGqAvfyaGPDA=">AAAC2HicjVHLSsNAFD2Nr1pf0S7dBIvgQkoiii5rBXFZwT6wrSVJpzU0L5KJUErBnbj1B9zqF4l/oH/hnTEFtYhOSHLm3HvOzL3XCl0n5rr+mlFmZufmF7KLuaXlldU1dX2jFgdJZLOqHbhB1LDMmLmOz6rc4S5rhBEzPctldWtwIuL1GxbFTuBf8GHI2p7Z952eY5ucqI6ab12bfHQ6vjrenaByRy3oRV0ubRoYKSggXZVAfUELXQSwkcADgw9O2IWJmJ4mDOgIiWtjRFxEyJFxhjFypE0oi1GGSeyAvn3aNVPWp73wjKXaplNceiNSatgmTUB5EWFxmibjiXQW7G/eI+kp7jakv5V6ecRyXBP7l26S+V+dqIWjhyNZg0M1hZIR1dmpSyK7Im6ufamKk0NInMBdikeEbamc9FmTmljWLnpryvibzBSs2NtpboJ3cUsasPFznNOgtlc0Dor6+X6hVE5HncUmtrBD8zxECWeooEreQzziCc/KpXKr3Cn3n6lKJtXk8W0pDx+oSpcH</latexit> ˆ FA, ˆ FB <latexit sha1_base64="5SpVm0dThrj6kHJbOLEm9Ji8Zeo=">AAACynicjVHLSsNAFD2Nr1pfVZdugkVwVZKC6LLoxoWLCvYBtUgyndbQaRImE6EUd/6AW/0w8Q/0L7wzTkEtohOSnDn3nDtz7w1TEWXK814LzsLi0vJKcbW0tr6xuVXe3mllSS4Zb7JEJLITBhkXUcybKlKCd1LJg3EoeDscnel4+47LLEriKzVJeW8cDONoELFAEdUeiEApHt+UK17VM8udB74FFdjVSMovuEYfCRhyjMERQxEWCJDR04UPDylxPUyJk4QiE+e4R4m8Oak4KQJiR/Qd0q5r2Zj2Omdm3IxOEfRKcro4IE9COklYn+aaeG4ya/a33FOTU99tQv/Q5hoTq3BL7F++mfK/Pl2LwgAnpoaIakoNo6tjNktuuqJv7n6pSlGGlDiN+xSXhJlxzvrsGk9mate9DUz8zSg1q/fManO861vSgP2f45wHrVrVP6p6l7VK/dSOuog97OOQ5nmMOs7RQNNU+YgnPDsXjnQmzvRT6hSsZxfflvPwAS/CkjE=</latexit> flatten <latexit sha1_base64="YyWZuvJNnbSGxYL4Ip77E86NEeE=">AAACy3icjVHLSsNAFD2Nr1pfVZdugkWom5KIosuiCG6ECvYBbZFkOq2heZGZCLW69Afc6n+Jf6B/4Z0xBbWITkhy5txz7sy91419T0jLes0ZM7Nz8wv5xcLS8srqWnF9oyGiNGG8ziI/SlquI7jvhbwuPenzVpxwJ3B93nSHJyrevOGJ8KLwUo5i3g2cQej1PeZIolqn5Q7rRXL3qliyKpZe5jSwM1BCtmpR8QUd9BCBIUUAjhCSsA8Hgp42bFiIietiTFxCyNNxjnsUyJuSipPCIXZI3wHt2hkb0l7lFNrN6BSf3oScJnbIE5EuIaxOM3U81ZkV+1vusc6p7jaiv5vlCoiVuCb2L99E+V+fqkWijyNdg0c1xZpR1bEsS6q7om5ufqlKUoaYOIV7FE8IM+2c9NnUHqFrV711dPxNKxWr9izTpnhXt6QB2z/HOQ0aexX7oGJd7Jeqx9mo89jCNso0z0NUcYYa6nqOj3jCs3FuCOPWuPuUGrnMs4lvy3j4AJNbkes=</latexit> E(·) <latexit sha1_base64="+/MslZvgSgUwDUGnit9ODuTaO9E=">AAACznicjVHLSsNAFD2Nr/quunQTLIKrkhREl0U3LivYB9QiyXRah+bFZFIspbj1B9zqZ4l/oH/hnTEFtYhOSHLm3HPuzL3XTwKRKsd5LVgLi0vLK8XVtfWNza3t0s5uM40zyXiDxUEs276X8kBEvKGECng7kdwL/YC3/OG5jrdGXKYijq7UOOHd0BtEoi+Yp4jq8LuEM2XwTansVByz7Hng5qCMfNXj0guu0UMMhgwhOCIowgE8pPR04MJBQlwXE+IkIWHiHFOskTcjFSeFR+yQvgPadXI2or3OmRo3o1MCeiU5bRySJyadJKxPs008M5k1+1vuicmp7zamv5/nColVuCX2L99M+V+frkWhj1NTg6CaEsPo6lieJTNd0Te3v1SlKENCnMY9ikvCzDhnfbaNJzW16956Jv5mlJrVe5ZrM7zrW9KA3Z/jnAfNasU9rjiX1XLtLB91Efs4wBHN8wQ1XKCOhun4I57wbNWtkTW17j+lViH37OHbsh4+AOXxlA8=</latexit> expectation <latexit sha1_base64="c6YWAQ+7mdRv51eVn3alFIwwCes=">AAAC5nicjVHLSsNAFD2N7/qKunQTLYIbS1oUXRYFcSMo2AeoSJJONTjJhMlELMW1O3fi1h9wq58i/oH+hXfGFHwgOiHJmXPvOTP3Xj/hYapc96VgDQwODY+MjhXHJyanpu2Z2UYqMhmweiC4kC3fSxkPY1ZXoeKslUjmRT5nTf98S8ebF0ymoYgPVDdhx5F3GoedMPAUUSf2wpFil8rv9FbLzpbwZMpWlFjZJjdnV7Qzzq5O7JJbds1yfoJKDkrI156wn3GENgQCZIjAEEMR5vCQ0nOIClwkxB2jR5wkFJo4wxWKpM0oi1GGR+w5fU9pd5izMe21Z2rUAZ3C6ZWkdLBEGkF5krA+zTHxzDhr9jfvnvHUd+vS38+9ImIVzoj9S9fP/K9O16LQwYapIaSaEsPo6oLcJTNd0Td3PlWlyCEhTuM2xSXhwCj7fXaMJjW16956Jv5qMjWr90Gem+FN35IGXPk+zp+gUS1X1srufrVU28xHPYp5LGKZ5rmOGnawhzp5X+MBj3iyzqwb69a6+0i1CrlmDl+Wdf8O7FucfQ==</latexit> 4. Coarse-to-Fine Module <latexit sha1_base64="stuJvCeWjCU8eDCQpOwvzNhK/lc=">AAACznicjVHLSsNAFD2Nr1pfVZdugkVwVZKC6LLoxmUF+4BaJJlOa2iSCZNJoZTi1h9wq58l/oH+hXfGFNQiOiHJmXPPuTP3Xj8Jg1Q5zmvBWlpeWV0rrpc2Nre2d8q7e61UZJLxJhOhkB3fS3kYxLypAhXyTiK5F/khb/ujCx1vj7lMAxFfq0nCe5E3jINBwDxFVJcJKXlo8G254lQds+xF4Oaggnw1RPkFN+hDgCFDBI4YinAIDyk9XbhwkBDXw5Q4SSgwcY4ZSuTNSMVJ4RE7ou+Qdt2cjWmvc6bGzeiUkF5JThtH5BGkk4T1abaJZyazZn/LPTU59d0m9PfzXBGxCnfE/uWbK//r07UoDHBmagiopsQwujqWZ8lMV/TN7S9VKcqQEKdxn+KSMDPOeZ9t40lN7bq3nom/GaVm9Z7l2gzv+pY0YPfnOBdBq1Z1T6rOVa1SP89HXcQBDnFM8zxFHZdooGk6/ognPFsNa2zNrPtPqVXIPfv4tqyHD+EblA0=</latexit> correlation <latexit sha1_base64="yHvCs0laogAaO3xbxpq20EJosbY=">AAACxXicjVHLSsNAFD2Nr1pfVZdugkVxVdKC6LLoQpdV7APaIsl0WofmRTIplCL+gFv9NfEP9C+8M05BLaITkpw5954zc+/1Yl+k0nFec9bC4tLySn61sLa+sblV3N5pplGWMN5gkR8lbc9NuS9C3pBC+rwdJ9wNPJ+3vNG5irfGPElFFN7IScx7gTsMxUAwVxJ13T28LZacsqOXPQ8qBpRgVj0qvqCLPiIwZAjAEUIS9uEipaeDChzExPUwJS4hJHSc4x4F0maUxSnDJXZE3yHtOoYNaa88U61mdIpPb0JKGwekiSgvIaxOs3U8086K/c17qj3V3Sb094xXQKzEHbF/6WaZ/9WpWiQGONU1CKop1oyqjhmXTHdF3dz+UpUkh5g4hfsUTwgzrZz12daaVNeueuvq+JvOVKzaM5Ob4V3dkgZc+TnOedCslivHZeeqWqqdmVHnsYd9HNE8T1DDJepokPcAj3jCs3VhBZa0xp+pVs5odvFtWQ8fqaqPkw==</latexit> & <latexit sha1_base64="uNy47YhixWxowmHYsKo9uYv2Z0A=">AAACynicjVHLSgMxFD2Or1pfVZduBovgqkwLosuiGxcuKtgH1CKZaVqHzoskI5bizh9wqx8m/oH+hTcxBbWIZpiZk3PPucm918+iUCrPe51z5hcWl5YLK8XVtfWNzdLWdkumuQh4M0ijVHR8JnkUJrypQhXxTiY4i/2It/3RqY63b7mQYZpcqnHGezEbJuEgDJgiqi3TgYrZ3XWp7FU8s9xZULWgDLsaaekFV+gjRYAcMTgSKMIRGCQ9XVThISOuhwlxglBo4hz3KJI3JxUnBSN2RN8h7bqWTWivc0rjDuiUiF5BThf75ElJJwjr01wTz01mzf6We2Jy6ruN6e/bXDGxCjfE/uWbKv/r07UoDHBsagippswwurrAZslNV/TN3S9VKcqQEadxn+KCcGCc0z67xiNN7bq3zMTfjFKzeh9YbY53fUsacPXnOGdBq1apHla8i1q5fmJHXcAu9nBA8zxCHWdooGmqfMQTnp1zRzhjZ/IpdeasZwfflvPwAV+hkkU=</latexit> softmax <latexit sha1_base64="IeL/tRnryF8Q5vXm7SE2uYw/TG0=">AAACzHicjVHLSsNAFD2Nr1pfVZdugkVwVVJRdFl040oq2IfYIkk6rUPzYjKxlNKtP+BWv0v8A/0L74xTUIvohCRnzr3nzNx7vSTgqXSc15w1N7+wuJRfLqysrq1vFDe3GmmcCZ/V/TiIRctzUxbwiNUllwFrJYK5oRewpjc4U/HmPRMpj6MrOUpYJ3T7Ee9x35VEXQ/bkocstYe3xZJTdvSyZ0HFgBLMqsXFF7TRRQwfGUIwRJCEA7hI6blBBQ4S4joYEycIcR1nmKBA2oyyGGW4xA7o26fdjWEj2ivPVKt9OiWgV5DSxh5pYsoThNVpto5n2lmxv3mPtae624j+nvEKiZW4I/Yv3TTzvzpVi0QPJ7oGTjUlmlHV+cYl011RN7e/VCXJISFO4S7FBWFfK6d9trUm1bWr3ro6/qYzFav2vsnN8K5uSQOu/BznLGgclCtHZefysFQ9NaPOYwe72Kd5HqOKc9RQJ+8Qj3jCs3VhSWtsTT5TrZzRbOPbsh4+APmIkuU=</latexit> w ⇥ w <latexit sha1_base64="uZjR57D00cM27uGOBHLD9qDpZDg=">AAAC2nicjVHLSsNAFD2Nr1pfVXHlJlgFVyUtiC6LgrisYFvBFknGsQ6mmTCZCKV0407c+gNu9YPEP9C/8M6YglpEJyQ5c+49Z+beG8ShSLTnveacicmp6Zn8bGFufmFxqbi80kxkqhhvMBlKdRr4CQ9FxBta6JCfxor7vSDkreD6wMRbN1wlQkYnuh/zTs/vRuJSMF8TdV5cY0rGsYi6rozczfaVrweHw83zYskre3a546CSgRKyVZfFF7RxAQmGFD1wRNCEQ/hI6DlDBR5i4joYEKcICRvnGKJA2pSyOGX4xF7Tt0u7s4yNaG88E6tmdEpIryKliy3SSMpThM1pro2n1tmwv3kPrKe5W5/+QebVI1bjiti/dKPM/+pMLRqX2LM1CKoptoypjmUuqe2Kubn7pSpNDjFxBl9QXBFmVjnqs2s1ia3d9Na38TebaVizZ1luindzSxpw5ec4x0GzWq7slL3jaqm2n406j3VsYJvmuYsajlBHg7wHeMQTnp22c+vcOfefqU4u06zi23IePgA+opee</latexit> cropping on ˆ F <latexit sha1_base64="MZ3lJGZ4KHaNEaS64IeBIgDirEY=">AAACynicjVHLSsNAFD2Nr/quunQTLIKrkoiiy6IbFy4q2Ae0RZLptB2aJmEyEUrozh9wqx8m/oH+hXfGFNQiOiHJmXPPuTP3Xj8ORKIc57VgLSwuLa8UV9fWNza3tks7u40kSiXjdRYFkWz5XsIDEfK6EirgrVhyb+wHvOmPLnW8ec9lIqLwVk1i3h17g1D0BfMUUc3O0FOZmN6Vyk7FMcueB24OyshXLSq9oIMeIjCkGIMjhCIcwENCTxsuHMTEdZERJwkJE+eYYo28Kak4KTxiR/Qd0K6dsyHtdc7EuBmdEtAryWnjkDwR6SRhfZpt4qnJrNnfcmcmp77bhP5+nmtMrMKQ2L98M+V/fboWhT7OTQ2CaooNo6tjeZbUdEXf3P5SlaIMMXEa9yguCTPjnPXZNp7E1K5765n4m1FqVu9Zrk3xrm9JA3Z/jnMeNI4r7mnFuTkpVy/yURexjwMc0TzPUMUVaqibKh/xhGfr2pLWxMo+pVYh9+zh27IePgBMpJI/</latexit> ˆ i <latexit sha1_base64="hba2RMo7VlCiENuwzTQw4Wozteg=">AAACynicjVHLSsNAFD2Nr1pfVZdugkVwVRJRdFl048JFBfuAtkgynbaxeTGZCCV05w+41Q8T/0D/wjtjCmoRnZDkzLnn3Jl7rxv7XiIt67VgLCwuLa8UV0tr6xubW+XtnWYSpYLxBov8SLRdJ+G+F/KG9KTP27HgTuD6vOWOL1S8dc9F4kXhjZzEvBc4w9AbeMyRRLW6I0dmd9PbcsWqWnqZ88DOQQX5qkflF3TRRwSGFAE4QkjCPhwk9HRgw0JMXA8ZcYKQp+McU5TIm5KKk8IhdkzfIe06ORvSXuVMtJvRKT69gpwmDsgTkU4QVqeZOp7qzIr9LXemc6q7Tejv5rkCYiVGxP7lmyn/61O1SAxwpmvwqKZYM6o6lmdJdVfUzc0vVUnKEBOncJ/igjDTzlmfTe1JdO2qt46Ov2mlYtWe5doU7+qWNGD75zjnQfOoap9UrevjSu08H3URe9jHIc3zFDVcoo6GrvIRT3g2rgxhTIzsU2oUcs8uvi3j4QNPBZJA</latexit> ˆ j <latexit sha1_base64="oN02b/X09Xsgnh/bKDpJWJPRInI=">AAACy3icjVHLSsNAFD2Nr1pfVZdugkV0VRJRdFl040aoYB9Qi0ym0zY2L5KJUGuX/oBb/S/xD/QvvDOmoBbRCUnOnHvOnbn3OpHnJtKyXnPGzOzc/EJ+sbC0vLK6VlzfqCdhGnNR46EXxk2HJcJzA1GTrvREM4oF8x1PNJzBqYo3bkWcuGFwKYeRaPusF7hdlzNJVPOqz+ToZrx7XSxZZUsvcxrYGSghW9Ww+IIrdBCCI4UPgQCSsAeGhJ4WbFiIiGtjRFxMyNVxgTEK5E1JJUjBiB3Qt0e7VsYGtFc5E+3mdIpHb0xOEzvkCUkXE1anmTqe6syK/S33SOdUdxvS38ly+cRK9In9yzdR/tenapHo4ljX4FJNkWZUdTzLkuquqJubX6qSlCEiTuEOxWPCXDsnfTa1J9G1q94yHX/TSsWqPc+0Kd7VLWnA9s9xToP6ftk+LFsXB6XKSTbqPLawjT2a5xEqOEMVNT3HRzzh2Tg3EuPOuP+UGrnMs4lvy3j4ANRsknE=</latexit> ˆ j0 <latexit sha1_base64="RnPmSRNHAHVF9/tVkRrMGrDT9yw=">AAAC2nicjVHLSsNAFD2Nr1pfVXHlJliEClJSUXRZdOOygn1AW0qSTuvYNAnJRCilG3fi1h9wqx8k/oH+hXfGKahFdEKSM+fec2buvU7o8VhY1mvKmJmdm19IL2aWlldW17LrG9U4SCKXVdzAC6K6Y8fM4z6rCC48Vg8jZg8cj9Wc/pmM125YFPPAvxTDkLUGds/nXe7agqh2divfFNzrsBEf72t0Pd5rZ3NWwVLLnAZFDXLQqxxkX9BEBwFcJBiAwYcg7MFGTE8DRVgIiWthRFxEiKs4wxgZ0iaUxSjDJrZP3x7tGpr1aS89Y6V26RSP3oiUJnZJE1BeRFieZqp4opwl+5v3SHnKuw3p72ivAbECV8T+pZtk/lcnaxHo4kTVwKmmUDGyOle7JKor8ubml6oEOYTESdyheETYVcpJn02liVXtsre2ir+pTMnKvatzE7zLW9KAiz/HOQ2qB4XiUcG6OMyVTvWo09jGDvI0z2OUcI4yKuQ9wiOe8Gw0jVvjzrj/TDVSWrOJb8t4+ABqzZge</latexit> (˜ i, ˜ j) <latexit sha1_base64="9jEcPWcgotShHACro9FoNRB7URI=">AAACz3icjVHLSsNAFD2Nr1pfVZdugkVwVdKC6EoKgrhQaKUvsEWSdFpDJ5mQTJRSFLf+gFv9K/EP9C+8M6agFtEJSc6ce8+Zufc6IfdiaVmvGWNmdm5+IbuYW1peWV3Lr280Y5FELmu4gouo7dgx417AGtKTnLXDiNm+w1nLGR6peOuaRbEngrochazr24PA63uuLYnqnIrj+rl5JnoJZ5f5glW09DKnQSkFBaSrKvIv6KAHARcJfDAEkIQ5bMT0XKAECyFxXYyJiwh5Os5wixxpE8pilGETO6TvgHYXKRvQXnnGWu3SKZzeiJQmdkgjKC8irE4zdTzRzor9zXusPdXdRvR3Ui+fWIkrYv/STTL/q1O1SPRxoGvwqKZQM6o6N3VJdFfUzc0vVUlyCIlTuEfxiLCrlZM+m1oT69pVb20df9OZilV7N81N8K5uSQMu/RznNGiWi6W9olUrFyqH6aiz2MI2dmme+6jgBFU0yDvEI57wbNSMG+POuP9MNTKpZhPflvHwAe4rk54=</latexit> LoFTR Module <latexit sha1_base64="rhB68Oc0gRFMX/esAAnj+TR8xfY=">AAAC0HicjVHLTsJAFD3UF+ILdemmkZjgBluikZUhuGGJRh4JgmnLgA192U6NhBjj1h9wq19l/AP9C++MJVGJ0Wnanjn3njNz7zUDx464pr2mlJnZufmF9GJmaXlldS27vtGI/Di0WN3yHT9smUbEHNtjdW5zh7WCkBmu6bCmOTwW8eY1CyPb9874KGAd1xh4dt+2DE5UJ6/vlXa7xWq30uxWLrI5raDJpU4DPQE5JKvmZ19wjh58WIjhgsEDJ+zAQERPGzo0BMR1MCYuJGTLOMMtMqSNKYtRhkHskL4D2rUT1qO98Iyk2qJTHHpDUqrYIY1PeSFhcZoq47F0Fuxv3mPpKe42or+ZeLnEclwS+5dukvlfnaiFo4+SrMGmmgLJiOqsxCWWXRE3V79UxckhIE7gHsVDwpZUTvqsSk0kaxe9NWT8TWYKVuytJDfGu7glDVj/Oc5p0CgW9IOCdrKfKx8lo05jC9vI0zwPUUYVNdTJ+wqPeMKzcqrcKHfK/Weqkko0m/i2lIcPFsaS1Q==</latexit> (1/8)2 H B W B <latexit sha1_base64="AsXy8ScbatACFYYaVQhd6G+nbCc=">AAAC0HicjVHLTsJAFD3UF+ILdemmkZjgBluikZWBuGGJRh4JgmnLgA192U6NhBjj1h9wq19l/AP9C++MJVGJ0Wnanjn3njNz7zUDx464pr2mlJnZufmF9GJmaXlldS27vtGI/Di0WN3yHT9smUbEHNtjdW5zh7WCkBmu6bCmOTwW8eY1CyPb9874KGAd1xh4dt+2DE5UJ6/vlXa7xWq30uxWLrI5raDJpU4DPQE5JKvmZ19wjh58WIjhgsEDJ+zAQERPGzo0BMR1MCYuJGTLOMMtMqSNKYtRhkHskL4D2rUT1qO98Iyk2qJTHHpDUqrYIY1PeSFhcZoq47F0Fuxv3mPpKe42or+ZeLnEclwS+5dukvlfnaiFo4+SrMGmmgLJiOqsxCWWXRE3V79UxckhIE7gHsVDwpZUTvqsSk0kaxe9NWT8TWYKVuytJDfGu7glDVj/Oc5p0CgW9IOCdrKfKx8lo05jC9vI0zwPUUYVNdTJ+wqPeMKzcqrcKHfK/Weqkko0m/i2lIcPEgOS0w==</latexit> (1/8)2 H A W A <latexit sha1_base64="xxS2p9vN9RTA0LQc1zQRRnYHdB4=">AAACyHicjVHLSsNAFD2Nr1pfVZdugkVwVZKC6EoKgoi4qNK0Qi2STKd1aF4kE6UUN/6AW/0y8Q/0L7wzpqAW0QlJzpx7z5m593qxL1JpWa8FY2Z2bn6huFhaWl5ZXSuvb7TSKEsYd1jkR8ml56bcFyF3pJA+v4wT7gaez9ve8EjF27c8SUUUNuUo5t3AHYSiL5griXLOouPmxXW5YlUtvcxpYOeggnw1ovILrtBDBIYMAThCSMI+XKT0dGDDQkxcF2PiEkJCxznuUSJtRlmcMlxih/Qd0K6TsyHtlWeq1YxO8elNSGlihzQR5SWE1WmmjmfaWbG/eY+1p7rbiP5e7hUQK3FD7F+6SeZ/daoWiT4OdA2Caoo1o6pjuUumu6Jubn6pSpJDTJzCPYonhJlWTvpsak2qa1e9dXX8TWcqVu1ZnpvhXd2SBmz/HOc0aNWq9l7VOq9V6of5qIvYwjZ2aZ77qOMEDTjkLfCIJzwbp0Zs3Bmjz1SjkGs28W0ZDx/U0JDS</latexit> LoFTR <latexit sha1_base64="9BXphpSr9pmtJ94+b9/vlaunj58=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVZKC6EoKbgQRKtgH1CLJdFpjp0mcTMRaXPkDbvXHxD/Qv/DOGEEtohOSnDn3njNz7/VjESTKcV5y1tT0zOxcfr6wsLi0vFJcXWskUSoZr7NIRLLlewkXQcjrKlCCt2LJvaEveNMfHOh485rLJIjCUzWKeWfo9cOgFzBPEdU4jrqp4OfFklN2zLIngZuBErJVi4rPOEMXERhSDMERQhEW8JDQ04YLBzFxHYyJk4QCE+e4Q4G0KWVxyvCIHdC3T7t2xoa0156JUTM6RdArSWljizQR5UnC+jTbxFPjrNnfvMfGU99tRH8/8xoSq3BB7F+6z8z/6nQtCj3smRoCqik2jK6OZS6p6Yq+uf2lKkUOMXEadykuCTOj/OyzbTSJqV331jPxV5OpWb1nWW6KN31LGrD7c5yToFEpuztl56RSqu5no85jA5vYpnnuoopD1FAn70s84BFP1pF1Zd1Ytx+pVi7TrOPbsu7fAcK4kZs=</latexit> Module <latexit sha1_base64="Q+7D5CPMw1a4EiwQOpjj8UZXD/4=">AAAC4HicjVHLSsNAFD3GV31XXboJFsFVSBTRZdGNm4KCbQUrkkyndTDNhGQiSnHhzp249Qfc6teIf6B/4Z1xBLWITkhy5tx7zsy9N0pjkSvffxlyhkdGx8ZLE5NT0zOzc+X5hUYui4zxOpOxzA6jMOexSHhdCRXzwzTjYS+KeTM629Hx5jnPciGTA3WZ8uNe2E1ER7BQEXVSXmopfqGiTn/dc2uhYqci6bo12S5ifnVSrvieb5Y7CAILKrBrT5af0UIbEgwFeuBIoAjHCJHTc4QAPlLijtEnLiMkTJzjCpOkLSiLU0ZI7Bl9u7Q7smxCe+2ZGzWjU2J6M1K6WCGNpLyMsD7NNfHCOGv2N+++8dR3u6R/ZL16xCqcEvuX7jPzvzpdi0IHW6YGQTWlhtHVMetSmK7om7tfqlLkkBKncZviGWFmlJ99do0mN7Xr3oYm/moyNav3zOYWeNO3pAEHP8c5CBprXrDh+ftrleq2HXUJS1jGKs1zE1XsYg918r7GAx7x5ETOjXPr3H2kOkNWs4hvy7l/Bysamk8=</latexit> 3. Matching Module <latexit sha1_base64="zc0aezEzeZIS0DlL6Uvlm7Cb88M=">AAAC9XicjVFNSxxBEH1O1BgTzSY5emlcArlkmV2Q5CgRJAcPCq4KKtLT1upgz/TQ3eMHi3/DW24hV/+A1/gXxH9g/oXV7QiJIqaHmXn9qt7rrqqs0rnzaXo9krwYHRt/OfFq8vWbqem3rXfv15ypraK+MtrYjUw60nlJfZ97TRuVJVlkmtazg4UQXz8k63JTrvqTirYLuVfmg1xJz9ROK93ydOyzwbDXEQtGWkefl+iQtFgySmqxSNLXlsSqlaUbGFuc7rTaaSeNSzwG3Qa00axl07rCFnZhoFCjAKGEZ6wh4fjZRBcpKua2MWTOMspjnHCKSdbWnEWcIZk94O8e7zYbtuR98HRRrfgUza9lpcBH1hjOs4zDaSLG6+gc2Ke8h9Ez3O2E/1njVTDrsc/sc7r7zP/VhVo8Bvgaa8i5pioyoTrVuNSxK+Hm4q+qPDtUzAW8y3HLWEXlfZ9F1LhYe+itjPGbmBnYsFdNbo0/4ZY84O7DcT4Ga71Od66TrvTa89+aUU9gBrP4xPP8gnl8xzL67H2GC/zGZXKU/Eh+Jr/uUpORRvMB/6zk/BaFr6LE</latexit> 2. Coarse-Level Local Feature Transform <latexit sha1_base64="TZUPvt+BIuIbixm7Tog9lEQFSV0=">AAAC5nicjVHLSsNAFD3GV31XXbqJFsGFlEQUXdYK4lLB2oKPkqTTdmiahMlEKKFrd+7ErT/gVj9F/AP9C++MKb4QnZDkzLn3nJl7rxv5PJaW9TxkDI+Mjo3nJianpmdm5/LzCydxmAiPVbzQD0XNdWLm84BVJJc+q0WCOV3XZ1W3s6fi1UsmYh4Gx7IXsfOu0wp4k3uOJKqeXz6T3G+wdL9/sVtPpeivfxBlTdTzBato6WX+BHYGCsjWYZh/whkaCOEhQRcMASRhHw5iek5hw0JE3DlS4gQhruMMfUySNqEsRhkOsR36tmh3mrEB7ZVnrNUeneLTK0hpYpU0IeUJwuo0U8cT7azY37xT7anu1qO/m3l1iZVoE/uXbpD5X52qRaKJHV0Dp5oizajqvMwl0V1RNzc/VSXJISJO4QbFBWFPKwd9NrUm1rWr3jo6/qIzFav2Xpab4FXdkgZsfx/nT3CyUbS3itbRZqFUzkadwxJWsEbz3EYJBzhEhbyvcI8HPBpt49q4MW7fU42hTLOIL8u4ewPkuJ23</latexit> ˜ FA tr , ˜ FB tr <latexit sha1_base64="xYs+hDM+y4CIMQLKjqgukEDtybM=">AAAC3HicjVHLSsNAFD2Nr1pfURcu3ASL4EJKKoouawVxWcE+oC+SdFqHpklIJkIp3bkTt/6AW/0e8Q/0L7wzRnwU0QlJzpx7z5l759qByyNhms8pbWp6ZnYuPZ9ZWFxaXtFX1yqRH4cOKzu+64c124qYyz1WFly4rBaEzBrYLqva/RMZr16xMOK+dyGGAWsOrJ7Hu9yxBFFtfaMhuNtho9Nx63j3ExfbetbMmWoZkyCfgCySVfL1JzTQgQ8HMQZg8CAIu7AQ0VNHHiYC4poYERcS4irOMEaGtDFlMcqwiO3Tt0e7esJ6tJeekVI7dIpLb0hKA9uk8SkvJCxPM1Q8Vs6S/c17pDxlbUP624nXgFiBS2L/0n1k/lcnexHo4kj1wKmnQDGyOydxidWtyMqNL10JcgiIk7hD8ZCwo5Qf92woTaR6l3drqfiLypSs3DtJboxXWSUNOP9znJOgspfLH+TM8/1soZiMOo1NbGGH5nmIAs5QQlnVf48HPGot7Vq70W7fU7VUolnHt6XdvQFVm5jZ</latexit> ˜ FA, ˜ FB <latexit sha1_base64="AJe3vcFSw3RK84aR3WWfbmHZjUA=">AAAB9XicbVDLSgNBEJyNrxhfUY9eBoPgKewGRE8S8OJBIYJ5QLKG3skkGTI7u8z0KmHJf3jxoIhX/8Wbf+Mk2YMmFjQUVd10dwWxFAZd99vJrayurW/kNwtb2zu7e8X9g4aJEs14nUUy0q0ADJdC8ToKlLwVaw5hIHkzGF1N/eYj10ZE6h7HMfdDGCjRFwzQSg+3gGwo1IDewJjrbrHklt0Z6DLxMlIiGWrd4lenF7Ek5AqZBGPanhujn4JGwSSfFDqJ4TGwEQx421IFITd+Ort6Qk+s0qP9SNtSSGfq74kUQmPGYWA7Q8ChWfSm4n9eO8H+hZ8KFSfIFZsv6ieSYkSnEdCe0JyhHFsCTAt7K2VD0MDQBlWwIXiLLy+TRqXsnZXdu0qpepnFkSdH5JicEo+ckyq5JjVSJ4xo8kxeyZvz5Lw4787HvDXnZDOH5A+czx8zspJM</latexit> Matching Layer <latexit sha1_base64="3/7EpPLUSUPKZvcFJc5MS3DcM2c=">AAAB9XicbVDLSgMxFL3js9ZX1aWbYBFclZmC6EoKunBZwT6gHUsmvdOGZjJDklHK0P9w40IRt/6LO//GtJ2Fth4IHM65J7k5QSK4Nq777aysrq1vbBa2its7u3v7pYPDpo5TxbDBYhGrdkA1Ci6xYbgR2E4U0igQ2ApG11O/9YhK81jem3GCfkQHkoecUWOlhxsehqhQGk5tolcquxV3BrJMvJyUIUe9V/rq9mOWRvYCJqjWHc9NjJ9RZTgTOCl2U40JZSM6wI6lkkao/Wy29YScWqVPwljZIw2Zqb8TGY20HkeBnYyoGepFbyr+53VSE176GZdJalCy+UNhKoiJybQC0ucKmRFjSyhT3O5K2JAqyowtqmhL8Ba/vEya1Yp3XnHvquXaVV5HAY7hBM7AgwuowS3UoQEMFDzDK7w5T86L8+58zEdXnDxzBH/gfP4AqFqSmA==</latexit> Di↵erentiable <latexit sha1_base64="0j85EC6VxGJJL8YD06ks4hRpdEI=">AAAB8XicbVDLSgMxFL2pr1pfVZdugkVwVWYKosuCG5cV7APboWTSTBuaSYYkI5Shf+HGhSJu/Rt3/o2ZdhbaeiBwOOdebs4JE8GN9bxvVNrY3NreKe9W9vYPDo+qxycdo1JNWZsqoXQvJIYJLlnbcitYL9GMxKFg3XB6m/vdJ6YNV/LBzhIWxGQsecQpsU56TJThOSFiWK15dW8BvE78gtSgQGtY/RqMFE1jJi0VxJi+7yU2yIi2nAo2rwxSwxJCp2TM+o5KEjMTZIsfz/GFU0Y4Uto9afFC/b2RkdiYWRy6yZjYiVn1cvE/r5/a6CbIuExSyyRdHopSga3CeXw84ppRK2aOEKpddorphGhCrSup4krwVyOvk06j7l/VvftGrdks6ijDGZzDJfhwDU24gxa0gYKEZ3iFN2TQC3pHH8vREip2TuEP0OcPBLKRIA==</latexit> positional <latexit sha1_base64="uBljJUNCXaUEI0gHVLPq/u2Jk/8=">AAAB73icbVDLSgMxFL1TX7W+qi7dBIvgqswURJcFNy4r2Ae0Q8lkMm1oHmOSEcrQn3DjQhG3/o47/8a0nYW2HggczrmX3HOilDNjff/bK21sbm3vlHcre/sHh0fV45OOUZkmtE0UV7oXYUM5k7RtmeW0l2qKRcRpN5rczv3uE9WGKflgpykNBR5JljCCrZN6VBIVMzkaVmt+3V8ArZOgIDUo0BpWvwaxIpmg0hKOjekHfmrDHGvLCKezyiAzNMVkgke076jEgpowX9w7QxdOiVGitHvSooX6eyPHwpipiNykwHZsVr25+J/Xz2xyE+ZMppl1yZYfJRlHVqF5eBQzTYnlU0cw0czdisgYa0ysq6jiSghWI6+TTqMeXNX9+0at2SzqKMMZnMMlBHANTbiDFrSBAIdneIU379F78d69j+VoySt2TuEPvM8fNjuQEQ==</latexit> encoding <latexit sha1_base64="1BoXq4FayW5GBWd6m7prH/qpgR8=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LLaCp5IURI9FL56kgq2FJpTNdtMu3WzC7kQopX/DiwdFvPpnvPlv3LY5aOuDgcd7M8zMC1MpDLrut1NYW9/Y3Cpul3Z29/YPyodHbZNkmvEWS2SiOyE1XArFWyhQ8k6qOY1DyR/D0c3Mf3zi2ohEPeA45UFMB0pEglG0kl/1UcTckLseq/bKFbfmzkFWiZeTCuRo9spffj9hWcwVMkmN6XpuisGEahRM8mnJzwxPKRvRAe9aqqjdFEzmN0/JmVX6JEq0LYVkrv6emNDYmHEc2s6Y4tAsezPxP6+bYXQVTIRKM+SKLRZFmSSYkFkApC80ZyjHllCmhb2VsCHVlKGNqWRD8JZfXiXtes27qLn39UrjOo+jCCdwCufgwSU04Baa0AIGKTzDK7w5mfPivDsfi9aCk88cwx84nz/fKZDq</latexit> ⇥Nc <latexit sha1_base64="XIYENWdlvrfpYWx7m1RzUcpYllo=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LLaCp5IURE9S8OJJKthaaELZbDft0s0m7E6EUvo3vHhQxKt/xpv/xm2bg7Y+GHi8N8PMvDCVwqDrfjuFtfWNza3idmlnd2//oHx41DZJphlvsUQmuhNSw6VQvIUCJe+kmtM4lPwxHN3M/Mcnro1I1AOOUx7EdKBEJBhFK/lVH0XMDbnrRdVeueLW3DnIKvFyUoEczV75y+8nLIu5QiapMV3PTTGYUI2CST4t+ZnhKWUjOuBdSxW1m4LJ/OYpObNKn0SJtqWQzNXfExMaGzOOQ9sZUxyaZW8m/ud1M4yugolQaYZcscWiKJMEEzILgPSF5gzl2BLKtLC3EjakmjK0MZVsCN7yy6ukXa95FzX3vl5pXOdxFOEETuEcPLiEBtxCE1rAIIVneIU3J3NenHfnY9FacPKZY/gD5/MH4oSQ6Q==</latexit> ⇥Nf <latexit sha1_base64="ecZC9ObuyD9gXYn3hXyEPAY9vLE=">AAACLnicbVDBattAFFy5TZO6aas2x1yW2gUXipEMJTmmlEIugRTixGAZs3p6srde7YrdVcAIfVEu+ZX2EEhLyLWf0ZWjQ2p3TsPMG957E+eCGxsEt17rydOtZ9s7z9svdl++eu2/eXtuVKEBh6CE0qOYGRRc4tByK3CUa2RZLPAiXnyp/YtL1IYreWaXOU4yNpM85cCsk6b+11Rpim5iSUExbZC6eMKhdmm3F1kuEix59bFh36sPNOKSRhmzc2CiPKmm0J36naAfrEA3SdiQDmlwOvV/RomCIkNpQTBjxmGQ20nJtOUgsGpHhcGcwYLNcOyoZBmaSbl6t6LvnZLQ+vBUSUtX6uNEyTJjllnsJusrzbpXi//zxoVNDycll3lhUcLDorQQ1Cpad0cTrhGsWDrCQHN3K4U50wysa7jtSgjXX94k54N++KkffBt0jj43deyQffKO9EhIDsgROSanZEiAXJEf5Bf57V17N96dd/8w2vKazB75B96fv/R6qRI=</latexit> for every coarse prediction (˜ i, ˜ j) 2 Mc <latexit sha1_base64="oGZIY88oLlqSZyOmbJaNwSFJW14=">AAAB+HicbVDLSsNAFL2pr1ofjbp0M9gKrkpSEF0W3bisYB/QhjCZTtqhk0mYmQg19EvcuFDErZ/izr9x0mahrQcGDufcyz1zgoQzpR3n2yptbG5t75R3K3v7B4dV++i4q+JUEtohMY9lP8CKciZoRzPNaT+RFEcBp71gepv7vUcqFYvFg54l1IvwWLCQEayN5NvV+jDCekIwz9pzn9R9u+Y0nAXQOnELUoMCbd/+Go5ikkZUaMKxUgPXSbSXYakZ4XReGaaKJphM8ZgODBU4osrLFsHn6NwoIxTG0jyh0UL9vZHhSKlZFJjJPKVa9XLxP2+Q6vDay5hIUk0FWR4KU450jPIW0IhJSjSfGYKJZCYrIhMsMdGmq4opwV398jrpNhvuZcO5b9ZaN0UdZTiFM7gAF66gBXfQhg4QSOEZXuHNerJerHfrYzlasoqdE/gD6/MHN1SSyw==</latexit> Pc <latexit sha1_base64="2PkG3C1/8NWpg03vn6qOgvL2F/s=">AAACGHicbVDLSsNAFJ3UV62vqEs3g61QQWpSEN0IRTduhAr2AU0sk+mkHTt5MDMRSshnuPFX3LhQxG13/o2TNAttPTBwOOde7pzjhIwKaRjfWmFpeWV1rbhe2tjc2t7Rd/faIog4Ji0csIB3HSQIoz5pSSoZ6YacIM9hpOOMr1O/80S4oIF/LychsT009KlLMZJK6uunFctDcoQRi2+TvgsvoRVXrRGSMU1OYEYekwcr5NQjx1ZS6etlo2ZkgIvEzEkZ5Gj29ak1CHDkEV9ihoTomUYo7RhxSTEjScmKBAkRHqMh6SnqI48IO86CJfBIKQPoBlw9X8JM/b0RI0+IieeoyTSFmPdS8T+vF0n3wo6pH0aS+Hh2yI0YlAFMW4IDygmWbKIIwpyqv0I8QhxhqbosqRLM+ciLpF2vmWc1465eblzldRTBATgEVWCCc9AAN6AJWgCDZ/AK3sGH9qK9aZ/a12y0oOU7++APtOkPz4yfoQ==</latexit> Mf = {(ˆ i, ˆ j0)} <latexit sha1_base64="lAmzKgRjSb1aOKZToY5XoLA1Orw=">AAACDnicbZDNSsNAEMcnftb60ahHL8EieCpJUfRY9OKxgv2AtpTNdtIu3WzC7kYoIe/g3au+gjfx6iv4Bj6G2zaCbf3DwJ//zDDDz485U9p1v6y19Y3Nre3CTnF3b/+gZB8eNVWUSIoNGvFItn2ikDOBDc00x3YskYQ+x5Y/vp32W48oFYvEg57E2AvJULCAUaJN1LdLXcEoBpLQ1MvSata3y27FnclZNV5uypCr3re/u4OIJiEKTTlRquO5se6lRGpGOWbFbqIwJnRMhtgxVpAQVS+dPZ45ZyYZOEEkTQntzNK/GykJlZqEvpkMiR6p5d40/K/XSXRw3UuZiBONgs4PBQl3dORMKTgDJpFqPjGGUMnMrw4dEUNBG1YLV37xZEWDxlsGsWqa1Yp3WXHvL8q1mxxSAU7gFM7BgyuowR3UoQEUEniGF3i1nqw36936mI+uWfnOMSzI+vwBz0ScZQ==</latexit> 1/2 ը૾͝ͱʹը૾શମΛ ߟྀͨ͠ಛ௃நग़ Self Cross ը૾ؒͷؔ܎ʹ ج͍ͮͨಛ௃நग़ Self Cross eature PCA Self Cross Feature PCA ը૾ؒͷରԠΛऔಘ ಛ௃఺ݕग़Λඞཁͱ͠ͳ͍ ը૾શମͷରԠ఺Ϛονϯά͕Մೳ
  11. าߦऀݕग़ )0( 47.   ޯ഑ํ޲ώετάϥϜ إݕग़ )BSSMJLF "EB#PPTU 

     CPYϑΟϧλʹΑΔ໌҉ࠩ ը૾Ϛονϯά 4*'5   εέʔϧෆม ಛ௃఺ݕग़ɾهड़ Ϋϥεͷը૾෼ྨ 4*'5 #0'   #BHදݱͷಋೖ ಛఆ෺ମೝࣝ ը૾෼ྨ ෺ମݕग़ ηϚϯςΟοΫηάϝϯςʔγϣϯ खॻ͖਺ࣈͷ෼ྨ $//   ৞ΈࠐΈχϡʔϥϧωοτϫʔΫ ϐΫηϧࠩ෼ 3BOEPN'PSFTU   ϐΫηϧࠩ෼ʹΑΔςΫενϟ 463'   ੵ෼ը૾ʹΑΔߴ଎Խ '"45   ܾఆ໦ʹΑΔίʔφʔݕग़ 03#   ڭࢣͳֶ͠शʹΑΔϖΞબ୒ 'JTIFS7FDUPS   ֬཰ີ౓ؔ਺ʹΑΔಛ௃දݱ 7-"%   ؔ࿈͢Δ78ͷಛ௃ +PJOU)0(   $P)0(   )0(ͷڞىදݱ #3*&'   ೋ஋ಛ௃      $"3%   ಛ௃ྔͷೋ஋Խ 5FYUPO   ϑΟϧλόϯΫ $)"-$   ہॴࣗݾ૬ؔ Ϋϥϥε෼ྨ "MFY/FU   ৞ΈࠐΈχϡʔϥϧωοτϫʔΫ ೥୅ 7((   ૚ (PPH-F/FU   ૚ 3FT/FU   ૚ʴ࢒ࠩ઀ଓ ଟΫϥε෺ମݕग़ 'BTUFS3$//   3FHJPO1SPQPTBM :0-0   4JOHMFTIPU 44%   4JOHMFTIPU '$/   ৞ΈࠐΈʹΑΔηάϝϯςʔγϣϯ 141/FU   1ZSBNJE1PPMJOH .%FU   ϚϧνϨϕϧಛ௃ϐϥϛου 7J5   7JTJPO5SBOTGPSNFS 4&/FU   &YDJUBUJPO %*/0  ."&   44- 7J5 4JN$-3   ରরֶश .P$P   ରরֶश 4FH/FU   &ODPEFSEFDPEFS %FFQ-BCW   "USPVT$POWPMVUJPO 4FH'PSNFS   7J5 %&53   5SBOTGPSNFS '1/   ಛ௃ϐϥϛου 6/FU   '$/Λར༻ & ffi DJFOU/FU   /"4 $// 7JTJPO5SBOTGPSNFS 4VQFS(MVF   (//ͷར༻ $FOUFS/FU   ΞϯΧʔϨε ࣗݾڭࢣ͋Γֶश  $-*1   ݴޠͱը૾ͷ ΞϥΠϝϯτ :0-08PSME   ΦʔϓϯϘΩϟϒϥϦ ෺ମݕग़ 074FH   ΦʔϓϯϘΩϟϒϥϦ ηϚηά %"--&   ը૾ੜ੒ ࣗݾڭࢣ͋Γֶशɾը૾ੜ੒ɾϚϧνϞʔμϧ 4".   ج൫Ϟσϧ (FNJOJ   --B7"   7JTJPO-BOHVBHFNPEFM Πϯελϯεηάϝϯςʔγϣϯ .BTL3$//   &OEUPFOEͰ࣮ݱ -P'53   5SBOTGPSNFSʹΑΔ&& &.."   &&ࣗಈӡస (SPPU/   7JTJPO-BOHVBHF"DUJPONPEFM #&35  (15   -BSHF-BOHVBHFNPEFM ڭࢣ͋Γֶश "#/   ਓͷ஌ݟͷ૊ΈࠐΈ -*'5   $//ͷར༻ ը૾ೝٕࣝज़ͷมભʢ࢛൒ੈلʣ  ୈੈ୅ɿϋϯυΫϥϑτಛ௃
  12. w ෺ମݕग़λεΫɿ͋ΔΧςΰϦ෺ମ͕ը૾தͷͲ͜ʹ͋Δ͔ΛٻΊΔ໰୊  إݕग़<7JPMB $713>  าߦऀݕग़<%BMBM $713> ෺ମݕग़λεΫʹ͓͚ΔϋϯυΫϥϑτಛ௃ʢϩʔΧϧʣ 

    ػցֶश ϋϯυΫϥϑτಛ௃ )BBSMJLFಛ௃ྔ إݕग़ྫ )0(ಛ௃ྔ าߦऀݕग़ྫ "EB#PPTU 47. ػցֶश ϋϯυΫϥϑτಛ௃
  13. w ػցֶशʹΑΓϩʔΧϧಛ௃Λ૊Έ߹ΘͤͯϛυϧϨϕϧಛ௃Λ֫ಘ  +PJOU)BBSMJLF<.JUB 1".*> إը૾ʹ͓͚Δߏ଄తͳྨࣅੑΛଊ͑Δ ͨΊʹෳ਺ͷಛ௃ྔͷڞىੑΛදݱ  +PJOU)0(<.JUTVJ *&*$&>

     $P)0(<8BUBOBCF *14+> าߦऀͷߏ଄తͳྨࣅੑΛଊ͑ΔͨΊʹෳ਺ͷಛ௃ྔͷڞىੑΛදݱ ෺ମݕग़λεΫʹ͓͚ΔϋϯυΫϥϑτಛ௃ʢϛυϧʣ              +PJOU)BBSMJLFGFBUVSF ϙδςΟϒΫϥε ωΨςΟϒΫϥε j =ʢ̍̍̍ʣ= ̓ ͖͍͠஋ॲཧ )0(ಛ௃ྔͷڞىදݱ த ޻ ߨ ࢁ ػ 4 Ѫ Te F ya h த෦େֶ ޻ֶ෦ϩϘο τཧ޻ֶՊ ڭत ౻٢߂࿱ ػց஌֮ˍϩϘςΟ Ϋεάϧʔϓ 487-8501 Ѫ஌ݝय़೔Ҫࢢদຊொ1200 Tel 0568-51-9096 Fax 0568-51-9409 [email protected] http://vision.cs.chubu.ac.jp ത࢜ ʢ޻ֶʣ M த෦େֶ ޻ֶ෦ϩϘο τཧ޻ֶՊ ڭत ౻٢߂࿱ ػց஌֮ˍϩϘςΟ Ϋεάϧʔϓ 487-8501 Ѫ஌ݝय़೔Ҫࢢদຊொ1200 Tel 0568-51-9096 Fax 0568-51-9409 [email protected] http://vision.cs.chubu.ac.jp ത࢜ ʢ޻ֶʣ MACHINE PERCEPTION AND ROBOTICS GROUP Chubu University Department of Robotics Science and Technology College of Engineering Professor Dr.Eng. Hironobu Fujiyoshi Machine Perception and Robotics Group 1200 Matsumoto-cho, Kasugai, Aichi 487-8501 Japan Tel +81-568-51-9096 Fax +81-568-51-9409 [email protected] http://vision.cs.chubu.ac.jp w )0(ಛ௃ྔͷޯ഑ͷؔ܎ੑΛଊ͑Δ r $P)0(<8BUBOBCF> w ہॴྖҬͷޯ഑ϖΞΛྦྷੵͨ͠ಉ࣌ਖ਼ىߦྻ r +PJOU)0(<ࡾҪ> w #PPTJOHʹΑΓࣝผʹ༗ޮͳہॴྖҬͷؔ܎ੑΛ֫ಘ $P)0( )0(ಛ௃ྔͷڞىදݱ த෦େֶ ޻ֶ෦৘ใ޻ֶՊ ߨࢣ ࢁԼོٛ ػց஌֮ˍϩϘςΟ Ϋ 487-8501 Ѫ஌ݝय़೔Ҫࢢদຊ Tel 0568-51-9670 Fax 0568-51-1540 [email protected] http://vision.cs.chubu MACHINE PERCEPTIO Chubu University Department of Compu த෦େֶ ޻ֶ෦ϩϘο τཧ޻ֶՊ ڭत ౻٢߂࿱ ػց஌֮ˍϩϘςΟ Ϋεάϧʔϓ 487-8501 Ѫ஌ݝय़೔Ҫࢢদຊொ1200 Tel 0568-51-9096 Fax 0568-51-9409 [email protected] http://vision.cs.chubu.ac.jp ത࢜ ʢ޻ֶʣ MACHINE PERCEPTION AND ROBOTICS GROUP Chubu University Department of Robotics Science and Technology ࢁԼོٛ ػց஌֮ˍϩϘςΟ Ϋεάϧʔ 487-8501 Ѫ஌ݝय़೔Ҫࢢদຊொ1200 Tel 0568-51-9670 Fax 0568-51-1540 [email protected] http://vision.cs.chubu.ac.jp MACHINE PERCEPTION AND R Chubu University Department of Computer Scie College of Engineering Lecturer Dr.Eng. Takayoshi Yam Machine Perception and Robo 1200 Matsumoto-cho, Kasuga 487-8501 Japan Tel +81-568-51-9670 Fax +81-568-51-1540 [email protected] http://vision.cs.chubu.ac.jp MACHINE PERCEPTION AND R ౻٢߂࿱ ػց஌֮ˍϩϘςΟ Ϋεάϧʔϓ 487-8501 Ѫ஌ݝय़೔Ҫࢢদຊொ1200 Tel 0568-51-9096 Fax 0568-51-9409 [email protected] http://vision.cs.chubu.ac.jp ത࢜ ʢ޻ֶʣ MACHINE PERCEPTION AND ROBOTICS GROUP Chubu University Department of Robotics Science and Technology College of Engineering Professor Dr.Eng. Hironobu Fujiyoshi Machine Perception and Robotics Group 1200 Matsumoto-cho, Kasugai, Aichi 487-8501 Japan Tel +81-568-51-9096 Fax +81-568-51-9409 [email protected] http://vision.cs.chubu.ac.jp MACHINE PERCEPTION AND ROBOTICS GROUP w )0(ಛ௃ྔͷޯ഑ͷؔ܎ੑΛଊ͑Δ r $P)0(<8BUBOBCF> w ہॴྖҬͷޯ഑ϖΞΛྦྷੵͨ͠ಉ࣌ਖ਼ىߦྻ r +PJOU)0(<ࡾҪ> w #PPTJOHʹΑΓࣝผʹ༗ޮͳہॴྖҬͷؔ܎ੑΛ֫ಘ $P)0( +PJOU)0(
  14. w ը૾෼ྨλεΫɿը૾தͷ෺ମ͕Ͳͷ෺ମΧςΰϦͰ͋Δ͔Λ෼ྨ͢Δ໰୊  #BHPGGFBUVSFT #0' CBHPGWJTVBMXPSET #P78 <$TVSLB $713> ը૾෼ྨλεΫʹ͓͚ΔϋϯυΫϥϑτಛ௃ʢάϩʔόϧʣ

     l'"$&z ˠ ˠ ˠ ˠ ˠ ˠ l#*,&z 4*'5ಛ௃ྔ 4*'5ಛ௃ྔ #BHPGWJTVBMXPSET #BHPGWJTVBMXPSET ػցֶश ϋϯυΫϥϑτಛ௃ 47. 47.
  15. าߦऀݕग़ )0( 47.   ޯ഑ํ޲ώετάϥϜ إݕग़ )BSSMJLF "EB#PPTU 

     CPYϑΟϧλʹΑΔ໌҉ࠩ ը૾Ϛονϯά 4*'5   εέʔϧෆม ಛ௃఺ݕग़ɾهड़ Ϋϥεͷը૾෼ྨ 4*'5 #0'   #BHදݱͷಋೖ ಛఆ෺ମೝࣝ ը૾෼ྨ ෺ମݕग़ ηϚϯςΟοΫηάϝϯςʔγϣϯ खॻ͖਺ࣈͷ෼ྨ $//   ৞ΈࠐΈχϡʔϥϧωοτϫʔΫ ϐΫηϧࠩ෼ 3BOEPN'PSFTU   ϐΫηϧࠩ෼ʹΑΔςΫενϟ 463'   ੵ෼ը૾ʹΑΔߴ଎Խ '"45   ܾఆ໦ʹΑΔίʔφʔݕग़ 03#   ڭࢣͳֶ͠शʹΑΔϖΞબ୒ 'JTIFS7FDUPS   ֬཰ີ౓ؔ਺ʹΑΔಛ௃දݱ 7-"%   ؔ࿈͢Δ78ͷಛ௃ +PJOU)0(   $P)0(   )0(ͷڞىදݱ #3*&'   ೋ஋ಛ௃      $"3%   ಛ௃ྔͷೋ஋Խ 5FYUPO   ϑΟϧλόϯΫ $)"-$   ہॴࣗݾ૬ؔ Ϋϥϥε෼ྨ "MFY/FU   ৞ΈࠐΈχϡʔϥϧωοτϫʔΫ ೥୅ 7((   ૚ (PPH-F/FU   ૚ 3FT/FU   ૚ʴ࢒ࠩ઀ଓ ଟΫϥε෺ମݕग़ 'BTUFS3$//   3FHJPO1SPQPTBM :0-0   4JOHMFTIPU 44%   4JOHMFTIPU '$/   ৞ΈࠐΈʹΑΔηάϝϯςʔγϣϯ 141/FU   1ZSBNJE1PPMJOH .%FU   ϚϧνϨϕϧಛ௃ϐϥϛου 7J5   7JTJPO5SBOTGPSNFS 4&/FU   &YDJUBUJPO %*/0   44- 7J5 4JN$-3   ରরֶश .P$P   ରরֶश 4FH/FU   &ODPEFSEFDPEFS %FFQ-BCW   "USPVT৞ΈࠐΈ 4FH'PSNFS   7J5 %&53   5SBOTGPSNFS '1/   ಛ௃ϐϥϛου 6/FU   '$/Λར༻ & ffi DJFOU/FU   /"4 $// 7JTJPO5SBOTGPSNFS 4VQFS(MVF   (//ͷར༻ $FOUFS/FU   ΞϯΧʔϨε ࣗݾڭࢣ͋Γֶश  $-*1   ݴޠͱը૾ͷ ΞϥΠϝϯτ :0-08PSME   ΦʔϓϯϘΩϟϒϥϦ ෺ମݕग़ 074FH   ΦʔϓϯϘΩϟϒϥϦ ηϚηά %"--&   ը૾ੜ੒ ࣗݾڭࢣ͋Γֶशɾը૾ੜ੒ɾϚϧνϞʔμϧ 4".   ج൫Ϟσϧ (FNJOJ   --B7"   7JTJPO-BOHVBHFNPEFM Πϯελϯεηάϝϯςʔγϣϯ .BTL3$//   &OEUPFOEͰ࣮ݱ -P'53   5SBOTGPSNFSʹΑΔ&& &.."   &&ࣗಈӡస (SPPU/   7JTJPO-BOHVBHF"DUJPONPEFM #&35  (15   -BSHF-BOHVBHFNPEFM ڭࢣ͋Γֶश "#/   ਓͷ஌ݟͷ૊ΈࠐΈ -*'5   $//ͷར༻ ը૾ೝٕࣝज़ͷมભʢ࢛൒ੈلʣ  ୈੈ୅ɿ$//ʹΑΔಛ௃දݱ֫ಘ
  16. $//ʢ৞ΈࠐΈχϡʔϥϧωοτϫʔΫʣͷಛ௃நग़ 19 த෦େֶϩΰ த෦େֶϩΰ $POWMBZFS 1PPMJOHMBZFS *OQVUMBZFS $POWMBZFS 1PPMJOHMBZFS '$MBZFS

    0VUQVUMBZFS '$MBZFS *OQVUJNBHF YY    ⋯ ⋯       'MBUUFO LFSOFMT TJ[FYY TUSJEF QBEEJOH LFSOFMT TJ[FYY TUSJEF QBEEJOH     ɹ৞ΈࠐΈͱϓʔϦϯάΛଟஈʹ܁Γฦ͢͜ͱͰ޿͍ൣғͷॏཁͳಛ௃Λू໿͠ ɹɹશ݁߹૚ͰҐஔʹґଘ͠ͳ͍ಛ௃Λ֫ಘʢϩʔΧϧˠϛυϧˠάϩʔόϧʣ ৞ΈࠐΈ ϓʔϦϯά ৞ΈࠐΈ ϓʔϦϯά શ݁߹ શ݁߹
  17. $//Χʔωϧʢ૚໨ʣͷը૾ہॴಛ௃ͱͯ͠ͷޮՌ த෦େֶϩΰ த෦େֶϩΰ )0(ʢํ޲ͷޯ഑ಛ௃ʣ "MFY/FUʢޯ഑Χʔωϧຕʣ "MFY/FUʢશΧʔωϧຕʣ › › › ›

    › › › › ›  ࣍ݩ  ࣍ݩ  ࣍ݩ ಛ௃ྔͷࢉग़ํ๏ )0(ɿ֤ըૉͰޯ഑ͷܭࢉ "MFY/FUɿ֤ըૉ୯ҐͰΧʔωϧͷ৞ΈࠐΈ ηϧຖʹώετάϥϜΛ࡞੒ ϒϩοΫྖҬʹΑΔਖ਼نԽ 
  18. த෦େֶϩΰ த෦େֶϩΰ w าߦऀݕग़λεΫʹͯධՁ */3*"1FSTPO%BUBTFU   )0( 47.˔ 

    )0(ʹࣅͨݸͷΧʔωϧ 47.˔  ݸશͯͷΧʔωϧ 47.˔  $//ಛ௃ DPOW  47.˔ ֶशʹΑΓ֫ಘͨ͠Χʔωϧ $// ͸ϋϯυΫϥϑτಛ௃ )0( ΑΓߴੑೳ "MFY/FU BMMLFSOFMTʣ "MFY/FU DPOWʣ )0( "MFY/FU TFMFDUFELFSOFMTʣ  $//Χʔωϧʢ૚໨ʣͷը૾ہॴಛ௃ͱͯ͠ͷޮՌ
  19. w $//ͷߏ଄ΛλεΫʹ߹Θͤͯઃܭֶͯ͠श $//ʹΑΔଟ༷ͳλεΫ΁ͷԠ༻  ʜ z1FSTPOz W H W′ 

    H′  H W W H ըૉ͝ͱʹΫϥε֬཰Λग़ྗ W H 1FSTPO 1FSTPO 1FSTPO 1FSTPO C ʜ ʜ ɿ৞ΈࠐΈ૚ ɿϓʔϦϯά૚ ɿΞοϓαϯϓϦϯά૚ C άϦου͝ͱʹ Ϋϥε֬཰ͱݕग़ྖҬΛग़ྗ Ϋϥε֬཰Λग़ྗ ೖྗ ग़ྗ C + B $// ग़ྗ݁Ռ $// $// ෺ମݕग़ɹ ը૾෼ྨɹ ηϚϯςΟοΫ ηάϝϯςʔγϣϯ ୅දతͳख๏ "MFY/FU 7(( (PPHMF/FU 3FT/FU 4&/FU 'BTUFS3$// :0-0 44% '1/ .%FU $FOUFS/FU .BTL3$// '$/ 6/FU 4FH/FU 141/FU %FFQ-BCW
  20. w "UUFOUJPO#SBODI/FUXPSL "#/ <'VLVJ $713>  ஫໨ྖҬΛද͢ΞςϯγϣϯϚοϓΛग़ྗ͢Δ"UUFOUJPO#SBODIΛಋೖͨ͠$// $//΁ͷτοϓμ΢ϯ৘ใʢਓͷ஌ݟʣͷ૊ΈࠐΈ  ʜ

    001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 How Small Network Can Detect Pedestrian? Anonymous CVPR submission Paper ID **** Abstract ਺ࣜ࡞੒༻ 1. Introduction t log y + (1 − t) log (1 − y) (1) vc i = 1 M × N M m=1 N n=1 fc m,n (xi) (2) v1 i , v2 i , v3 i , vc i (3) g′(xi) = g(xi) · 1 C C c=1 fc(xi) (4) f (xi, yi) (5) 2. Concolusion References GFBUVSFNBQ3FT#MPDL 'FBUVSF&YUSBDUPS DPOWʙ3FT#MPDL "UUFOUJPO#SBODI ("1 3FT "UUFOUJPO #MPDL 1FSDFQUJPO#SBODI 'FBUVSF&YUSBDUPS "UUFOUJPONBQ *OQVUJNBHF "UUFOUJPONBQ ʜ 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 CVPR #**** CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIB How Small Network Can Detect Pedestrian? Anonymous CVPR submission Paper ID **** Abstract ਺ࣜ࡞੒༻ 1. Introduction t log y + (1 − t) log (1 − y) (1) vc i = 1 M × N M m=1 N n=1 fc m,n (xi) (2) v1 i , v2 i , v3 i , vc i (3) g′(xi) = g(xi) · 1 C C c=1 fc(xi) (4) f (xi, yi) (5) 2. Concolusion × 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 How Small Network Can Detect Pedestrian? Anonymous CVPR submission Paper ID **** Abstract ਺ࣜ࡞੒༻ 1. Introduction t log y + (1 − t) log (1 − y) (1) vc i = 1 M × N M m=1 N n=1 fc m,n (xi) (2) v1 i , v2 i , v3 i , vc i (3) g′(xi) = g(xi) · 1 C C c=1 fc(xi) (4) f (xi, yi) (5) 2. Concolusion "UUFOUJPONBQ 'FBUVSFNBQ 3FT#MPDL $MBTTJ fi DBUJPOSFTVMU 0VUQVUMBZFS Σ 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 ਺ࣜ࡞੒༻ 1. Introduction t log y + (1 − t) log (1 − y) (1) vc i = 1 M × N M m=1 N n=1 fc m,n (xi) (2) v1 i , v2 i , v3 i , vc i (3) g′(xi) = g(xi) · 1 C C c=1 fc(xi) (4) f (xi, yi) (5) 2. Concolusion References
  21. w "UUFOUJPO#SBODI/FUXPSL "#/ <'VLVJ $713>  "UUFOUJPONBQΛਓͷ஫໨ྖҬͱͷϊϧϜޡࠩΛ༻͍ͯ"UUFOUJPOCSBODIͱ1FSDFQUJPOCSBODIΛ ϑΝΠϯνϡʔχϯά $//΁ͷτοϓμ΢ϯ৘ใʢਓͷ஌ݟʣͷ૊ΈࠐΈ 

    ʜ 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 How Small Network Can Detect Pedestrian? Anonymous CVPR submission Paper ID **** Abstract ਺ࣜ࡞੒༻ 1. Introduction t log y + (1 − t) log (1 − y) (1) vc i = 1 M × N M m=1 N n=1 fc m,n (xi) (2) v1 i , v2 i , v3 i , vc i (3) g′(xi) = g(xi) · 1 C C c=1 fc(xi) (4) f (xi, yi) (5) 2. Concolusion References GFBUVSFNBQ3FT#MPDL 'FBUVSF&YUSBDUPS DPOWʙ3FT#MPDL "UUFOUJPO#SBODI ("1 3FT "UUFOUJPO #MPDL 1FSDFQUJPO#SBODI 'FBUVSF&YUSBDUPS "UUFOUJPONBQ *OQVUJNBHF ʜ 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 CVPR #**** CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIB How Small Network Can Detect Pedestrian? Anonymous CVPR submission Paper ID **** Abstract ਺ࣜ࡞੒༻ 1. Introduction t log y + (1 − t) log (1 − y) (1) vc i = 1 M × N M m=1 N n=1 fc m,n (xi) (2) v1 i , v2 i , v3 i , vc i (3) g′(xi) = g(xi) · 1 C C c=1 fc(xi) (4) f (xi, yi) (5) 2. Concolusion × 3FT#MPDL $MBTTJ fi DBUJPOSFTVMU 0VUQVUMBZFS Σ 𝐿 𝑚 𝑎𝑝 = 𝛼 𝑥 − 𝑦 2 2 ਓͷ஌ݟʢ஫໨ྖҬʣ 5VOFE 'SP[FO
  22. าߦऀݕग़ )0( 47.   ޯ഑ํ޲ώετάϥϜ إݕग़ )BSSMJLF "EB#PPTU 

     CPYϑΟϧλʹΑΔ໌҉ࠩ ը૾Ϛονϯά 4*'5   εέʔϧෆม ಛ௃఺ݕग़ɾهड़ Ϋϥεͷը૾෼ྨ 4*'5 #0'   #BHදݱͷಋೖ ಛఆ෺ମೝࣝ ը૾෼ྨ ෺ମݕग़ ηϚϯςΟοΫηάϝϯςʔγϣϯ खॻ͖਺ࣈͷ෼ྨ $//   ৞ΈࠐΈχϡʔϥϧωοτϫʔΫ ϐΫηϧࠩ෼ 3BOEPN'PSFTU   ϐΫηϧࠩ෼ʹΑΔςΫενϟ 463'   ੵ෼ը૾ʹΑΔߴ଎Խ '"45   ܾఆ໦ʹΑΔίʔφʔݕग़ 03#   ڭࢣͳֶ͠शʹΑΔϖΞબ୒ 'JTIFS7FDUPS   ֬཰ີ౓ؔ਺ʹΑΔಛ௃දݱ 7-"%   ؔ࿈͢Δ78ͷಛ௃ +PJOU)0(   $P)0(   )0(ͷڞىදݱ #3*&'   ೋ஋ಛ௃      $"3%   ಛ௃ྔͷೋ஋Խ 5FYUPO   ϑΟϧλόϯΫ $)"-$   ہॴࣗݾ૬ؔ Ϋϥϥε෼ྨ "MFY/FU   ৞ΈࠐΈχϡʔϥϧωοτϫʔΫ ೥୅ 7((   ૚ (PPH-F/FU   ૚ 3FT/FU   ૚ʴ࢒ࠩ઀ଓ ଟΫϥε෺ମݕग़ 'BTUFS3$//   3FHJPO1SPQPTBM :0-0   4JOHMFTIPU 44%   4JOHMFTIPU '$/   ৞ΈࠐΈʹΑΔηάϝϯςʔγϣϯ 141/FU   1ZSBNJE1PPMJOH .%FU   ϚϧνϨϕϧಛ௃ϐϥϛου 7J5   7JTJPO5SBOTGPSNFS 4&/FU   &YDJUBUJPO %*/0  ."&   44- 7J5 4JN$-3   ରরֶश .P$P   ରরֶश 4FH/FU   &ODPEFSEFDPEFS %FFQ-BCW   "USPVT$POWPMVUJPO 4FH'PSNFS   7J5 %&53   5SBOTGPSNFS '1/   ಛ௃ϐϥϛου 6/FU   '$/Λར༻ & ffi DJFOU/FU   /"4 $// 7JTJPO5SBOTGPSNFS 4VQFS(MVF   (//ͷར༻ $FOUFS/FU   ΞϯΧʔϨε ࣗݾڭࢣ͋Γֶश  $-*1   ݴޠͱը૾ͷ ΞϥΠϝϯτ :0-08PSME   ΦʔϓϯϘΩϟϒϥϦ ෺ମݕग़ 074FH   ΦʔϓϯϘΩϟϒϥϦ ηϚηά %"--&   ը૾ੜ੒ ࣗݾڭࢣ͋Γֶशɾը૾ੜ੒ɾϚϧνϞʔμϧ 4".   ج൫Ϟσϧ (FNJOJ   --B7"   7JTJPO-BOHVBHFNPEFM Πϯελϯεηάϝϯςʔγϣϯ .BTL3$//   &OEUPFOEͰ࣮ݱ -P'53   5SBOTGPSNFSʹΑΔ&& &.."   &&ࣗಈӡస (SPPU/   7JTJPO-BOHVBHF"DUJPONPEFM #&35  (15   -BSHF-BOHVBHFNPEFM ڭࢣ͋Γֶश "#/   ਓͷ஌ݟͷ૊ΈࠐΈ -*'5   $//ͷར༻ ը૾ೝٕࣝज़ͷมભʢ࢛൒ੈلʣ  ୈੈ୅ɿ7J5ʹΑΔಛ௃දݱ֫ಘ
  23. w 5SBOTGPSNFSΛ7JTJPO෼໺ʹԠ༻ͨ͠ը૾෼ྨख๏  ը૾Λݻఆύονʹ෼ղ  4FMG"UUFOUJPOʹΑΓύονؒͷؔ܎ੑΛଊ͑Δ  *NBHF/FUͳͲͷΫϥε෼ྨλεΫͰ405" 7JTJPO5SBOTGPSNFS 7J5

    <%PTPWJUTLJZ *$-3>  Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Scaled Dot-Product Attention Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. 3.2.1 Scaled Dot-Product Attention We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries and keys of dimension dk , and values of dimension dv . We compute the dot products of the query with all keys, divide each by p dk , and apply a softmax function to obtain the weights on the values. In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V . We compute the matrix of outputs as: Attention(Q, K, V ) = softmax( QKT p dk )V (1) The two most commonly used attention functions are additive attention [2], and dot-product (multi- plicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of 1 p dk . Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code. Scaled Dot-Product Attention Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. 3.2.1 Scaled Dot-Product Attention We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries and keys of dimension dk , and values of dimension dv . We compute the dot products of the query with all keys, divide each by p dk , and apply a softmax function to obtain the weights on the values. In practice, we compute the attention function on a set of queries simultaneously, packed together 5SBOTGPSNFS 7J5 7J5ͷ࢓૊Έ
  24. w ը૾தͷ৘ใͷॏཁ౓΍૬ؔؔ܎Λֶशɾදݱ x5 x4 x3 x2 x1 4FMG"UUFOUJPOͷ࢓૊Έ q1 k1

    v1 q2 k2 v2 q3 k3 v3 q4 k4 v4 q5 k5 v5 e1 e2 e3 e4 e5 &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. qi = Wq ei ki = Wk ei vi = Wv ei  ᶃ2VFSZ ,FZ 7BMVF ϕΫτϧʹม׵
  25. w ը૾தͷ৘ใͷॏཁ౓΍૬ؔؔ܎Λֶशɾදݱ q1 k1 v1 q2 k2 v2 q3 k3

    v3 q4 k4 v4 q5 k5 v5 [α1,1 α1,2 α1,3 α1,4 α1,5 ] [α2,1 α2,2 α2,3 α2,4 α2,5 ] [α3,1 α3,2 α3,3 α3,4 α3,5 ] [α4,1 α4,2 α4,3 α4,4 α4,5 ] [α5,1 α5,2 α5,3 α5,4 α5,5 ] [ ̂ α1,1 ̂ α1,2 ̂ α1,3 ̂ α1,4 ̂ α1,5 ] [ ̂ α2,1 ̂ α2,2 ̂ α2,3 ̂ α2,4 ̂ α2,5 ] [ ̂ α3,1 ̂ α3,2 ̂ α3,3 ̂ α3,4 ̂ α3,5 ] [ ̂ α4,1 ̂ α4,2 ̂ α4,3 ̂ α4,4 ̂ α4,5 ] [ ̂ α5,1 ̂ α5,2 ̂ α5,3 ̂ α5,4 ̂ α5,5 ] TPGUNBY TPGUNBY TPGUNBY TPGUNBY TPGUNBY e1 e2 e3 e4 e5 &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. ֤ύον͕ଞͷύονʹରͯ͠ͲΕ͚ͩʮ஫໨ʯ͍ͯ͠Δ͔Λද͢"UUFOUJPOXFJHIUΛܭࢉ ˠը૾தͷ৘ใͷॏཁ౓΍૬ؔؔ܎Λֶशɾදݱ ̂ α = softmax( QKT dk )  x5 x4 x3 x2 x1 ᶄύονؒͷؔ࿈Λܭࢉ ᶃ2VFSZ ,FZ 7BMVF ϕΫτϧʹม׵ 4FMG"UUFOUJPOͷ࢓૊Έ "UUFOUJPOXFJHIU
  26. w ը૾தͷ৘ใͷॏཁ౓΍૬ؔؔ܎Λֶशɾදݱ q1 k1 v1 q2 k2 v2 q3 k3

    v3 q4 k4 v4 q5 k5 v5 [α1,1 α1,2 α1,3 α1,4 α1,5 ] [α2,1 α2,2 α2,3 α2,4 α2,5 ] [α3,1 α3,2 α3,3 α3,4 α3,5 ] [α4,1 α4,2 α4,3 α4,4 α4,5 ] [α5,1 α5,2 α5,3 α5,4 α5,5 ] [ ̂ α1,1 ̂ α1,2 ̂ α1,3 ̂ α1,4 ̂ α1,5 ] [ ̂ α2,1 ̂ α2,2 ̂ α2,3 ̂ α2,4 ̂ α2,5 ] [ ̂ α3,1 ̂ α3,2 ̂ α3,3 ̂ α3,4 ̂ α3,5 ] [ ̂ α4,1 ̂ α4,2 ̂ α4,3 ̂ α4,4 ̂ α4,5 ] [ ̂ α5,1 ̂ α5,2 ̂ α5,3 ̂ α5,4 ̂ α5,5 ] ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ ⊕ ⊗ ⊗ ⊗ ⊗ ⊗ output1 output2 output3 output4 output5 TPGUNBY TPGUNBY TPGUNBY TPGUNBY TPGUNBY e1 e2 e3 e4 e5 &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH &NCFEEJOH 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. 1& Figure 1: The Transformer - model architecture. Attention(Q, K, V) = ̂ αV  x5 x4 x3 x2 x1 ᶄύονؒͷؔ࿈Λܭࢉ ᶃ2VFSZ ,FZ 7BMVF ϕΫτϧʹม׵ ᶅ7ͱͷՃॏ࿨ʹΑΓ ग़ྗΛܭࢉ x5 x4 x3 x2 x1 4FMG"UUFOUJPOͷ࢓૊Έ
  27. w ෺ମݕग़  %&53<$BSJPO &$$7> w ηϚϯςΟοΫηάϝϯςʔγϣϯ  4FH'PSNFS<9JF /FVS*14>

    w ஌ࣝৠཹʹΑΔޮ཰తͳ7J5ͷֶश  %FJ5<5PVWSPO *$.-> w 7J5ͷࣗݾڭࢣ͋Γֶश  ."&<)F $713> 7J5ʹΑΔଟ༷ͳλεΫ΁ͷԠ༻  Overlap Patch Embeddings Transformer Block 1 MLP Layer ! " × # " ×"$ ! % × # % ×"& ! '& × # '& ×"" ! $( × # $( ×"' ! " × # " ×4" MLP ! " × # " ×$)*+ Transformer Block 2 Transformer Block 3 Transformer Block 4 Overlap Patch Merging Efficient Self-Attn Mix-FFN ×" UpSample MLP ! "!"# × # "!"# ×"$ ! "!"# × # "!"# ×" ! % × # % ×" Encoder Decoder Figure 2: The proposed SegFormer framework consists of two main modules: A hierarchical Transformer encoder to extract coarse and fine features; and a lightweight All-MLP decoder to directly fuse these multi-level features and predict the semantic segmentation mask. “FFN” indicates feed-forward network. in an end-to-end manner. After that, researchers focused on improving FCN from different aspects such as: enlarging the receptive field [17–19, 5, 2, 4, 20]; refining the contextual information [21– 29]; introducing boundary information [30–37]; designing various attention modules [38–46]; or using AutoML technologies [47–51]. These methods significantly improve semantic segmentation performance at the expense of introducing many empirical modules, making the resulting framework computationally demanding and complicated. More recent methods have proved the effectiveness of Transformer-based architectures for semantic segmentation [7, 46]. However, these methods are still computationally demanding. Transformer backbones. ViT [6] is the first work to prove that a pure Transformer can achieve state-of-the-art performance in image classification. ViT treats each image as a sequence of tokens and then feeds them to multiple Transformer layers to make the classification. Subsequently, DeiT [52] further explores a data-efficient training strategy and a distillation approach for ViT. More recent methods such as T2T ViT [53], CPVT [54], TNT [55], CrossViT [56] and LocalViT [57] introduce tailored changes to ViT to further improve image classification performance. Beyond classification, PVT [8] is the first work to introduce a pyramid structure in Transformer, demonstrating the potential of a pure Transformer backbone compared to CNN counterparts in dense prediction tasks. After that, methods such as Swin [9], CvT [58], CoaT [59], LeViT [60] and Twins [10] enhance the local continuity of features and remove fixed size position embedding to improve the performance of Transformers in dense prediction tasks. Transformers for specific tasks. DETR [52] is the first work using Transformers to build an end-to- end object detection framework without non-maximum suppression (NMS). Other works have also used Transformers in a variety of tasks such as tracking [61, 62], super-resolution [63], ReID [64], Colorization [65], Retrieval [66] and multi-modal learning [67, 68]. For semantic segmentation, Training data-efficient image transformers & distillation throug VHOIDWWHQWLRQ ))1UHVLGXDO0/3 FODVV WRNHQ GLVWLOODWLRQ WRNHQ SDWFK WRNHQV Figure 2. Our distillation procedure: we simply include a new dis- tillation token. It interacts with the class and patch tokens through the self-attention layers. This distillation token is employed in a similar fashion as the class token, except that on output of the network its objective is to reproduce the (hard) label predicted by the teacher, instead of true label. Both the class and distillation tokens input to the transformers are learned by back-propagation. true label. Let yt = argmaxc Zt (c) be the hard decision of the teacher, the objective associated with this hard-label distillation is: LhardDistill global = 1 2 LCE (ψ(Zs ), y) + 1 2 LCE (ψ(Zs ), yt ). (3) For a given image, the hard label associated with the teacher may change depending on the specific data augmentation. We will see that this choice is better than the traditional one, while being parameter-free and conceptually simpler: The teacher prediction yt plays the same role as the true label y. Label smoothing. Hard labels can also be converted into soft labels with label smoothing (Szegedy et al., 2016), where the true label is considered to have a probability the output of the teache remaining complement Fine-tuning with disti and teacher prediction d resolution. We use a tea typically obtained from method of Touvron et a true labels only but thi and leads to a lower pe Classification with ou test time, both the class duced by the transform fiers and able to infer th is the late fusion of thes add the softmax outpu prediction. We evaluate 5. Experiments This section presents a sults. We first discuss comparatively analyze vnets and vision transfo 5.1. Transformer mod As mentioned earlier, o the one proposed by Do volutions. Our only dif and the distillation toke for the pre-training bu any confusion, we refe work by ViT, and prefix refers to our referent m architecture as ViT-B. W resolution, we append t the end, e.g, DeiT-B↑38 procedure, we identify The parameters of ViT-B %&53<$BSJPO &$$7> 4FH'PSNFS<9JF /FVS*14> ."&<,)F $713> %FJ5<5PVWSPO *$.->
  28. w 7J5͕ͲͷΑ͏ͳಛ௃Λଊ͍͑ͯΔ͔ΛධՁ<5VKJ $PH4DJ>  ("/ʹΑΔελΠϧม׵Ͱੜ੒ͨ͠ը૾Λೖྗ  ධՁର৅ɿ7J5 $// ਓؒ 

    ධՁࢦඪ ܗঢ়ͷׂ߹ਖ਼͍͠ܗঢ়Ϋϥεͱࣝผ ਖ਼͍͠ܗঢ়Ϋϥεͱࣝผ ਖ਼͍͠ςΫενϟΫϥεͱࣝผ 7J5ʹ͓͚Δಛ௃දݱ֫ಘ ("/ʹΑΔ ελΠϧม׵ ೣ ৅ ೣˠܗঢ়Λଊ͍͑ͯΔ ৅ˠςΫενϟΛଊ͍͑ͯΔ $// PS 7J5 ෼ྨ݁Ռ 
  29. w 7J5͕ͲͷΑ͏ͳಛ௃Λଊ͍͑ͯΔ͔ΛධՁ<5VKJ $PH4DJ> 7J5ʹ͓͚Δಛ௃දݱ֫ಘ  ataset. off-diagonal ng the error

    nfusion ma- mpares what 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Fraction of 'texture' decisions Fraction of 'shape' decisions Shape categories • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • ResNet−50 AlexNet VGG−16 GoogLeNet ViT−B_16 ViT−L_32 Humans (avg.) ςΫενϟ ܗঢ় ਓؒ ♦︎ ˠܗঢ়Λॏࢹ $// ˔˙˛˔ ˠςΫενϟΛॏࢹ 7J5 ˝˝ ˠ$//ͱൺ΂ͯܗঢ়Λॏࢹ
  30. าߦऀݕग़ )0( 47.   ޯ഑ํ޲ώετάϥϜ إݕग़ )BSSMJLF "EB#PPTU 

     CPYϑΟϧλʹΑΔ໌҉ࠩ ը૾Ϛονϯά 4*'5   εέʔϧෆม ಛ௃఺ݕग़ɾهड़ Ϋϥεͷը૾෼ྨ 4*'5 #0'   #BHදݱͷಋೖ ಛఆ෺ମೝࣝ ը૾෼ྨ ෺ମݕग़ ηϚϯςΟοΫηάϝϯςʔγϣϯ खॻ͖਺ࣈͷ෼ྨ $//   ৞ΈࠐΈχϡʔϥϧωοτϫʔΫ ϐΫηϧࠩ෼ 3BOEPN'PSFTU   ϐΫηϧࠩ෼ʹΑΔςΫενϟ 463'   ੵ෼ը૾ʹΑΔߴ଎Խ '"45   ܾఆ໦ʹΑΔίʔφʔݕग़ 03#   ڭࢣͳֶ͠शʹΑΔϖΞબ୒ 'JTIFS7FDUPS   ֬཰ີ౓ؔ਺ʹΑΔಛ௃දݱ 7-"%   ؔ࿈͢Δ78ͷಛ௃ +PJOU)0(   $P)0(   )0(ͷڞىදݱ #3*&'   ೋ஋ಛ௃      $"3%   ಛ௃ྔͷೋ஋Խ 5FYUPO   ϑΟϧλόϯΫ $)"-$   ہॴࣗݾ૬ؔ Ϋϥϥε෼ྨ "MFY/FU   ৞ΈࠐΈχϡʔϥϧωοτϫʔΫ ೥୅ 7((   ૚ (PPH-F/FU   ૚ 3FT/FU   ૚ʴ࢒ࠩ઀ଓ ଟΫϥε෺ମݕग़ 'BTUFS3$//   3FHJPO1SPQPTBM :0-0   4JOHMFTIPU 44%   4JOHMFTIPU '$/   ৞ΈࠐΈʹΑΔηάϝϯςʔγϣϯ 141/FU   1ZSBNJE1PPMJOH .%FU   ϚϧνϨϕϧಛ௃ϐϥϛου 7J5   7JTJPO5SBOTGPSNFS 4&/FU   &YDJUBUJPO %*/0  ."&   44- 7J5 4JN$-3   ରরֶश .P$P   ରরֶश 4FH/FU   &ODPEFSEFDPEFS %FFQ-BCW   "USPVT$POWPMVUJPO 4FH'PSNFS   7J5 %&53   5SBOTGPSNFS '1/   ಛ௃ϐϥϛου 6/FU   '$/Λར༻ & ffi DJFOU/FU   /"4 $// 7JTJPO5SBOTGPSNFS 4VQFS(MVF   (//ͷར༻ $FOUFS/FU   ΞϯΧʔϨε ࣗݾڭࢣ͋Γֶश  $-*1   ݴޠͱը૾ͷ ΞϥΠϝϯτ :0-08PSME   ΦʔϓϯϘΩϟϒϥϦ ෺ମݕग़ 074FH   ΦʔϓϯϘΩϟϒϥϦ ηϚηά %"--&   ը૾ੜ੒ ࣗݾڭࢣ͋Γֶशɾը૾ੜ੒ɾϚϧνϞʔμϧ 4".   ج൫Ϟσϧ (FNJOJ   --B7"   7JTJPO-BOHVBHFNPEFM Πϯελϯεηάϝϯςʔγϣϯ .BTL3$//   &OEUPFOEͰ࣮ݱ -P'53   5SBOTGPSNFSʹΑΔ&& &.."   &&ࣗಈӡస (SPPU/   7JTJPO-BOHVBHF"DUJPONPEFM #&35  (15   -BSHF-BOHVBHFNPEFM ڭࢣ͋Γֶश "#/   ਓͷ஌ݟͷ૊ΈࠐΈ -*'5   $//ͷར༻ ը૾ೝٕࣝज़ͷมભʢ࢛൒ੈلʣ  ୈੈ୅ɿ7-.
  31. w ࢹ֮৘ใͱݴޠ৘ใΛॲཧՄೳͳϚϧνϞʔμϧϞσϧ 7JTJPOBOE-BOHVBHF.PEFM 7-. ը૾ͱςΩετΛϖΞͰֶश ςΩετΛग़ྗ ϚϧνϞʔμϧ --B7" (FNJOJ BQIPUPPG\DBS^

    ɿ BQIPUPPG\SPBE^ ɿ BQIPUPPG\CJLF^ ɿ "5IJTQIPUPTIPXTBCMVFDBS ESJWJOHEPXOBSPBEJOUIF NPVOUBJOT 21MFBTF EFTDSJCFXIBUZPV TFFJOUIFJNBHF DBS  SPBE  CJLFʜ େن໛ݱݴޠϞσϧ ςΩετ ςΩετ $-*1 ը૾ Τϯίʔμ ϏσΦ Τϯίʔμ ΦʔσΟΦ Τϯίʔμ ը૾ σίʔμ ϏσΦ σίʔμ ΦʔσΟΦ σίʔμ େن໛ݱݴޠϞσϧ ը૾ Τϯίʔμ ը૾ Τϯίʔμ ςΩετ Τϯίʔμ 
  32. w $-*1 $POUSBTUJWF-BOHVBHF*NBHF1SFUSBJOJOH ʹΑΔಛ௃දݱ֫ಘ  ը૾ͱݴޠͷಛ௃ྔΛඥ͚ͮΔΑ͏ʹରরֶश  ԯͷը૾ͱݴޠϖΞΛ΢Σϒ͔Βऩूֶͯ͠शʹ࢖༻ ը૾ͱςΩετΛϖΞͰֶशɿը૾ͱݴޠͷରԠ 

    ʜ    ʜ     ʜ     ʜ     ʜ  ʜ ʜ ʜ ʜ I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 (1) Contrastive pre-training Image Encoder Text Encoder Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup Pepper the aussie pup T1 T2 T3 … I1 I2 I3 ⋮ (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … IN … ⋮ ⋱ IN·TN I1·T3 ಛ௃ྔͷྨࣅ౓ؔ܎ ʢίαΠϯྨࣅ౓ʣ ཧ૝తͳྨࣅ౓ؔ܎ $&MPTT
  33. w ඥ͚ͮͨը૾ͱݴޠͷؔ܎͔ΒΫϥε෼ྨΛθϩγϣοτͰղ͘͜ͱ͕Մೳ  ϓϩϯϓτςϯϓϨʔτʹΫϥε໊Λ౰ͯ͸ΊͨςΩετ͔Βಛ௃ྔΛநग़  Ϋϥε෼ྨΛߦ͍͍ͨը૾͔Βಛ௃ྔΛநग़  ը૾ͱςΩετؒͷಛ௃ྔͷίαΠϯྨࣅ౓Λܭࢉˠྨࣅ౓͕࠷΋ߴ͍ςΩετͷΫϥεͱ൑ఆ $-*1ʹΑΔಛ௃දݱ֫ಘ 

    I1·T2 I1·T3 … I2·T1 I2·T3 … I3·T1 I3·T2 … ⋮ ⋮ ⋮ I1·T1 I2·T2 I3·T3 T1 T2 T3 … (2) Create dataset classifier from label text plane car dog ⋮ bird A photo of a {object}. ⋮ Text Encoder T1 T2 T3 TN … (3) Use for zero-shot prediction Image Encoder I1 I1·T2 I1·TN I1·T1 … … A photo of a dog. TN IN·T1 IN·T2 IN·T3 I1·TN I2·TN I3·TN ⋮ … … ⋱ IN·TN I1·T3 lEPHzΫϥεͱ൑ఆ l"QIPUPPGBQMBOFz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBDBSz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBEPHz͔Βநग़ͨ͠ಛ௃ྔ l"QIPUPPGBCJSEz͔Βநग़ͨ͠ಛ௃ྔ ϓϩϯϓτ ςϯϓϨʔτ
  34. w $*'"3ͷඈߦػΫϥεʹ͓͍ͯը૾ͱϓϩϯϓτؒͷಛ௃ྔͷྨࣅ౓Λܭࢉ w ࠷΋ྨࣅ౓͕ߴ͍ϓϩϯϓτ͝ͱʹ࿮Λ৭෇͚ ը૾ͱݴޠʢΩϟϓγϣϯʣΛඥ͚ͮΔޮՌ  ੨࿮ BQIPUPPGBBJSQMBOF fl ZJOHUISPVHIUIFCMVFTLZ

    ੺࿮ BCMBDLBOEXIJUFQIPUP PGBBJSQMBOF ྘࿮ BQIPUPPGBBJSQMBOF UIBUMBOEFEPOUIFHSPVOE ˠಉҰΫϥε಺ʹ͓͍ͯ΋ݴޠʹ߹Θͤͨॊೈͳಛ௃ɾ஌ࣝΛ֫ಘ
  35. w ΦʔϓϯϘΩϟϒϥϦͷը૾ೝࣝ  ࣄલʹֶश͞Εͨϥϕϧʹରͯ͠ͷೝࣝʢ෼ྨʣ͚ͩͰͳ͘ɼ೚ҙͷςΩετͰࢦఆ͞Εͨະ஌ͷ ϥϕϧΛೝࣝ͢Δ͜ͱ͕Մೳ w $-*1 5FYU&ODPEFS ͷར༻ 

    ςΩετΛҙຯۭؒʹຒΊࠐΈɺΦʔϓϯϘΩϟϒϥϦ෺ମݕग़Λ࣮ݱ :P-P8PSME<$IFOH $713> YOLO Backbone Text Encoder A man and a woman are skiing with a dog caption, noun phrases, category… User Text Embeddings Multi-scale Image Features Text Contrastive Head Box Head man woman dog Vocabulary Embeddings man woman dog User’s Vocabulary Dog Image-aware Embeddings Multi-scale Image Features Training: Online Vocabulary Deployment: Offline Vocabulary Vision-Language PAN Object Embeddings Region-Text Matching Input Image Extract Nouns 
  36. w ೚ҙͷޠኮʹରͯ͠ը૾ͷηάϝϯςʔγϣϯ͕Մೳͳख๏ w .BTL'PSNFSͱ$-*1ʹΑͬͯߏ੒  .BTL'PSNFSɿΫϥεඇґଘͷϚεΫΛग़ྗ  $-*1ɿϚεΫྖҬͷΦʔϓϯϘΩϟϒϥϦ෼ྨث ͱͯ͠࢖༻ 074FH<-JBOH

    $713>  bridge sky A white cute cat lying on the ground. (a) CLIP is pre-trained with natural images CLIP Mask proposal generator … CLIP classification A photo of a {bridge} (b) Skeleton of two-stage approaches (c) Bo … 20 40 60 mIoU on ADE20K-150 Orac Orac masked images … 0.99 0.00 0.01 0.30 0.35 0.35 … … … 0.35 0.28 0.47 0.01 0.02 0.97 apple orange teapot CLIP image Mask Former There are apple and orange and teapot. Noun Parser CLIP text … ! "! ! "" ! "# ! "#$! #! #" #% teapot apple orange Query: golden gate, yacht Ukulele Query: saturn V, blossom Query: golden gate, yacht Query: Oculus, Ukulele ఆٛ͞Ε͍ͯͳ͍ΦϒδΣΫτ΍֓೦ʹରͯ͠΋෼ׂ͕Մೳ
  37. w $-*1ͷը૾ɾݴޠΤϯίʔμΛ༻͍֦ͨࢄϞσϧʢը૾ੜ੒Ϟσϧʣ  $-*1ͷݴޠΤϯίʔμͱ֦ࢄϞσϧʢσίʔμʣΛ઀ଓͯ͠ςΩετ͔Βը૾Λੜ੒  $-*1ͷݴޠಛ௃Λը૾ಛ௃ʹม׵͢ΔQSJPSͷڭࢣ৴߸ͱͯ͠$-*1ͷը૾ΤϯίʔμΛ࢖༻ %"--& VO$-*1 <3BNFTI BS9JW>

     $-*1ͷݴޠಛ௃Λ$-*1ͷը૾ಛ௃ʹม׵ $-*1ͷը૾ಛ௃͔Βը૾Λੜ੒ ϕϨʔ๧ͱࠇ͍λʔτϧωοΫΛணࣲͨݘ ੕Ӣͷരൃͱͯ͠ඳ͔ΕͨίʔΪʔͷ಄ $-*1ͷθϩγϣοτੑೳʹΑΓະ஌ͷޠኮͷ૊Έ߹Θͤ΍ֶशσʔλ֎ͷը૾΋ੜ੒Մೳ
  38. w ࢹ֮৘ใͱݴޠ৘ใΛॲཧՄೳͳϚϧνϞʔμϧϞσϧ 7JTJPOBOE-BOHVBHF.PEFM 7-. ը૾ͱςΩετΛϖΞͰֶश ςΩετΛग़ྗ ϚϧνϞʔμϧ --B7" (FNJOJ BQIPUPPG\DBS^

    ɿ BQIPUPPG\SPBE^ ɿ BQIPUPPG\CJLF^ ɿ "5IJTQIPUPTIPXTBCMVFDBS ESJWJOHEPXOBSPBEJOUIF NPVOUBJOT 21MFBTF EFTDSJCFXIBUZPV TFFJOUIFJNBHF DBS  SPBE  CJLFʜ େن໛ݱݴޠϞσϧ ςΩετ ςΩετ $-*1 ը૾ Τϯίʔμ ϏσΦ Τϯίʔμ ΦʔσΟΦ Τϯίʔμ ը૾ σίʔμ ϏσΦ σίʔμ ΦʔσΟΦ σίʔμ େن໛ݱݴޠϞσϧ ը૾ Τϯίʔμ ը૾ Τϯίʔμ ςΩετ Τϯίʔμ 
  39. w ࢹ֮త৘ใͱςΩετ৘ใΛಉ࣌ʹॲཧͯ͠ੜ੒͢Δ7-.  4FMG"UUFUOJPO ɿύονը૾ʢೖྗτʔΫϯʣؒͷؔ܎ʹج͍ͮͯಛ௃Λදݱ  $SPTT"UUFOUJPO ɿը૾ͷग़ྗτʔΫϯͱ࣭໰จͱͷؔ܎ʹج͍ͮͨಛ௃Λ֫ಘ ςΩετΛग़ྗɿ7J5 --.

     *NBHF&ODPEFS 7J5 Preprint. Under review. Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder eprint. Under review. Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder gure 1: Model overview. We split an image into fixed-size patches, linearly embed each of them, d position embeddings, and feed the resulting sequence of vectors to a standard Transformer coder. In order to perform classification, we use the standard approach of adding an extra learnable 5FYU%FDPEFS --. ςΩετೖྗɿ࣭໰จʢϓϩϯϓτʣ ςΩετग़ྗɿճ౴จ $SPTT"UUFOUJPO 4FMG"UUFOUJPO ೖྗτʔΫϯɿ ग़ྗτʔΫϯɿ ը૾ೖྗɿ
  40. w 7JTVBM2VFTUJPO"OTXFSJOH 72"   ը૾ʹؔ͢Δ࣭໰ΛจͰ෇͚ɼ౴͑Λੜ੒͢ΔλεΫ ʢྫʣ࣭໰ʮ͜ͷը૾Ͱݘ͸ԿΛ͍ͯ͠·͔͢ ʯˠɹग़ྗʮϘʔϧΛ௥͍͔͚͍ͯ·͢ʯ w *NBHF$BQUJPOJOH

     ը૾Λೖྗͱ͠ɼͦͷ಺༰Λ؆ܿʹจষͰઆ໌͢ΔλεΫ ʢྫʣೖྗਫंΛ֦େڸͰݟΔগ೥ͷը૾ɹˠग़ྗʮগ೥͕ਫंΛ؍࡯͍ͯ͠Δʯ w 5FYUUP*NBHF(FOFSBUJPO  จࣈͰࣄ෺΍৔໘Λࢦఆͨ͠આ໌Λ΋ͱʹɺରԠ͢Δը૾ΛҰ͔Βੜ੒͢ΔλεΫ ʢྫʣจʮେւͷલͷΧϥʔϑϧͷεʔπΛணͨݘʯˠɹग़ྗରԠ͢Δը૾Λੜ੒ w 3FGFSSJOH&YQSFTTJPO$PNQSFIFOTJPO  ࢦࣔઆ໌จΛ΋ͱʹɼը૾தͷର৅෺Λಛఆ͢ΔλεΫ ʢྫʣจʮ͠Ζ͍ҥΛணͨਓͷӈଆͷݘʯˠग़ྗࢦఆ͞ΕͨݘͷൣғΛό΢ϯσΟϯά 7-.ͷओཁλεΫͱͦͷྫ 
  41. w --.ɿେن໛ͳςΩετσʔλΛ༻͍ͨࣄલֶशʹΑΓɼ෯޿͍λεΫ΁ͷదԠ͕Մೳ  จ຺Λߟྀͨ͠ݴޠཧղͱੜ੒͕Մೳ  5SBOTGPSNFSΛجͱͨ͠൚༻తͳϞσϧ͕ొ৔ େن໛ݴޠϞσϧ --. ͷൃల த෦େֶϩΰ

    த෦େֶϩΰ 5SBOTGPSNFS<7BTXBOJ /*14> (15 (15 ʜ #&35<%FWMJO "$-> --B."<5PVWSPO BS9JW> 7JDVOB<1FOH BS9JW> ,0","<1FOH /FVSM14> %F#&35B<)F BS9JW> 3P#&35B<-JV *$-3> "-#&35<-BO *$-3> 
  42. w $-*1ͷը૾Τϯίʔμͱ--B."ϕʔεͷݴޠσίʔμ --. Λར༻ͨ͠7-.  ը૾ಛ௃ͱݴޠಛ௃Λڮ౉͢͠ΔWJTJPOMBOHVBHFDPOOFDUPSΛಋೖ  ܭࢉίετ͕গͳ͘ɼ୹ظؒͰֶशՄೳʢº"(16Ͱ໿೔ʣ --B7"<-JV $713>

    ը૾Τϯίʔμ $-*17J5- WJTJPOMBOHVBHFDPOOFDUPS .-1 ݴޠσίʔμ --B." UPLFOJ[FS FNCFEEJOH 6TFSXIBUJT VOVTVBMBCPVU UIJTJNBHF 6TFS *GUIFSFBSFGBDUVBMFSSPSTJOUIFRVFTUJPOT QPJOUJUPVU JGOPU QSPDFFEUPBOTXFSJOHUIFRVFTUJPO 8IBU`TIBQQFOJOHJOUIFEFTFSU ࣭໰ʹࣄ࣮ޡೝ͕͋Δ৔߹͸ࢦఠ͍ͯͩ͘͠͞ɻͦ͏Ͱͳ͍৔߹͸ɺ࣭໰΁ͷճ౴ʹਐΜͰ͍ͩ͘͞ɻ ࠭യͰ͸Կ͕ى͍ͬͯ͜ΔͷͰ͠ΐ͏͔ʁ --B7" 5IFSFBSFOPEFTFSUTJOUIFJNBHF5IFJNBHFGFBUVSFTBCFBDI XJUIQBMNUSFFT BDJUZTLZMJOF BOEBMBSHFCPEZPGXBUFS ը૾ʹ͸࠭യ͸͍ࣸͬͯ·ͤΜɻ Ϡγͷ໦͕ੜ͍ໜΔϏʔνɺ֗ͷεΧΠϥΠϯɺͦͯ͠޿େͳਫҬ͕͍ࣸͬͯ·͢ɻ ςΩετग़ྗ 
  43. w ̎ஈ֊ͷֶशϓϩηεʹΑΓஈ֊తʹదԠ  ը૾ͱݴޠͷΞϥΠϝϯτֶश 1SFUSBJOJOH   ը૾ͱςΩετͷϖΞΛ࢖ͬͯࢹ֮ಛ௃ͱ--.ͷ୯ޠຒΊࠐΈۭؒΛ੔߹  7JTJPOMBOHVBHFDPOOFDUPS

    .-1 ͷΈΛֶश  ը૾ʹج͍ͮͨϢʔβʔࢦࣔ΁ͷԠ౴Λֶश *OTUSVDUJPO5VOJOH   ສ݅ͷࢦࣔσʔλΛ࢖༻  ը૾ΤϯίʔμҎ֎ͷ෦෼Λֶश --B7"ͷֶश ը૾Τϯίʔμ $-*17J5- WJTJPOMBOHVBHFDPOOFDUPS .-1 ݴޠσίʔμ --B." UPLFOJ[FS FNCFEEJOH 6TFSXIBUJT VOVTVBMBCPVU UIJTJNBHF 5VOFE 'SP[FO 
  44. w ̎ஈ֊ͷֶशϓϩηεʹΑΓஈ֊తʹదԠ  ը૾ͱݴޠͷΞϥΠϝϯτֶश 1SFUSBJOJOH   ը૾ͱςΩετͷϖΞΛ࢖ͬͯࢹ֮ಛ௃ͱ--.ͷ୯ޠຒΊࠐΈۭؒΛ੔߹  7JTJPOMBOHVBHFDPOOFDUPS

    .-1 ͷΈΛֶश  ը૾ʹج͍ͮͨϢʔβʔࢦࣔ΁ͷԠ౴Λֶश *OTUSVDUJPO5VOJOH   ສ݅ͷࢦࣔσʔλΛ࢖༻  ը૾ΤϯίʔμҎ֎ͷ෦෼Λֶश --B7"ͷֶश ը૾Τϯίʔμ $-*17J5- WJTJPOMBOHVBHFDPOOFDUPS .-1 ݴޠσίʔμ --B." UPLFOJ[FS FNCFEEJOH 6TFSXIBUJT VOVTVBMBCPVU UIJTJNBHF ˠ$-*1ͷը૾ΤϯίʔμΛར༻͠ ը૾ಛ௃ΛݴޠۭؒʹదԠ 5VOFE 'SP[FO 
  45. w ࢹ֮৘ใͱݴޠ৘ใΛॲཧՄೳͳϚϧνϞʔμϧϞσϧ 7JTJPOBOE-BOHVBHF.PEFM 7-. ը૾ͱςΩετΛϖΞͰֶश ςΩετΛग़ྗ ϚϧνϞʔμϧ --B7" (FNJOJ BQIPUPPG\DBS^

    ɿ BQIPUPPG\SPBE^ ɿ BQIPUPPG\CJLF^ ɿ "5IJTQIPUPTIPXTBCMVFDBS ESJWJOHEPXOBSPBEJOUIF NPVOUBJOT 21MFBTF EFTDSJCFXIBUZPV TFFJOUIFJNBHF DBS  SPBE  CJLFʜ େن໛ݱݴޠϞσϧ ςΩετ ςΩετ $-*1 ը૾ Τϯίʔμ ϏσΦ Τϯίʔμ ΦʔσΟΦ Τϯίʔμ ը૾ σίʔμ ϏσΦ σίʔμ ΦʔσΟΦ σίʔμ େن໛ݱݴޠϞσϧ ը૾ Τϯίʔμ ը૾ Τϯίʔμ ςΩετ Τϯίʔμ 
  46. w (FNJOJ<"OJM BS9JW>  (PPHMF͕։ൃͨ͠ը૾ɾԻ੠ɾಈըɾςΩετશͯΛԣஅ͢ΔϚϧνϞʔμϧ--. .--.   ը૾ཧղʹՃ͑ͯԻ੠ཧղɾಈըղੳɾίʔυੜ੒ɾଟݴޠରԠ΁ͱਐԽ 

    ֤ϞμϦςΟΛઐ༻ΤϯίʔμͰॲཧͯ͠ڞ௨ͷ5SBOTGPSNFSσίʔμʹ౷߹ͯ͠ਪ࿦ ϚϧνϞʔμϧɿ༷ʑͳϞμϦςΟͷೖग़ྗ 
  47. w .--.ͱͯ͠(PPHMF͕։ൃͨ͠(FNJOJ<)XBOH BS9JW>Λ࢖༻  (FNJOJɿը૾ͱςΩετΛೖྗ͠ɼςΩετΛग़ྗ͢ΔࣗݾճؼܕϞσϧ 5SBOTGPSNFS%FDPEFS  w ӡసλεΫΛ(FNJOJͷೖग़ྗؔ܎ʹམͱ͠ࠐΜͰ&.."Λߏங 

    ೖྗɿपғͷ৘ใʢը૾ʣɼӡసφϏήʔγϣϯʢςΩετʣɼաڈͷࣗं྆ͷ৘ใʢςΩετʣ  ग़ྗɿӡసλεΫʹؔ͢Δग़ྗʢςΩετʣ &.."ͷϞσϧߏ଄ 
  48. w 3ʙ3ͷεςοϓͷࢥߟΛߦͳ͔ͬͯΒະདྷͷߦಈ 8BZ1PJOU Λੜ੒ &.."ʹΑΔϓϥϯχϯάɿ$IBJOPGUIPVHIU1SPNQUJOH  γʔϯ 3 ɿఱީɼ࣌ࠁɼަ௨ঢ়گɼಓ࿏৚݅ͳͲΛؚΜͩઆ໌ ɹʢྫʣఱؾ͸շ੖ͰனؒͰ͢ɻಓ࿏͸ंઢͷ෼཭͞Ε͍ͯͳ͍௨ΓͰɼதԝʹԣஅาಓ͕͋Γ·͢ɻ

    ɹɹɹ௨Γͷ྆ଆʹ͸றं͞Εͨं͕͋Γ·͢ɻ ॏཁͳ෺ମ 3 ɿࣗं྆ͷӡసʹӨڹΛ༩͑ΔՄೳੑ͕͋Δ෺ମΛಛఆʢ%#&7࠲ඪͰදݱʣ ɹʢྫʣาߦऀ͕< >ʹ͍·͢ɻं͕྆< >ʹ͍·͢ɻɹ ॏཁͳ෺ମͷڍಈ 3 ɿॏཁͳ෺ମͷݱࡏͷঢ়ଶͱҙਤΛઆ໌ ɹʢྫʣาߦऀ͸ݱࡏɼาಓʹཱ͍ͬͯͯಓ࿏Λݟ͓ͯΓɼಓ࿏Λԣஅ͠Α͏ͱ͍ͯ͠ΔՄೳੑ͕͋Γ·͢ɻ ӡసͷҙࢥܾఆ 3 ɿ؍ଌ݁ՌΛجʹӡసܭըΛཁ໿ʢӡసͷҙࢥܾఆΛΧςΰϦʹ෼ྨʣ ɹʢྫʣݱࡏͷ௿଎Λҡ࣋͢΂͖Ͱ͢ɻ
  49. &.."ʹΑΔϓϥϯχϯά  %෺ମݕग़ 3PBE(SBQI ϓϥϯχϯά %෺ମݕग़ 3PBE(SBQI ϓϥϯχϯά ΰϛା ˠΰϛାʢݕग़ର৅Ͱ͸ͳ͍෺ମʣΛආ͚ΔΑ͏ʹϓϥϯχϯά

    ˠϦεʢݕग़ର৅Ͱ͸ͳ͍෺ମʣʹͿ͔ͭΒͳ͍Α͏ʹݮ଎ Ϧε ˠ৴߸͕ԫ৭ͷͨΊݮ଎ ৴߸ ଟ༷ͳӡసγφϦΦʹ͓͍ͯ҆શ͔ͭޮ཰తͳϓϥϯχϯάΛੜ੒ ˠ৴߸͕੺৭ͷͨΊఀࢭ ৴߸
  50. w 7JTJPO-BOHVBHF"DUJPOϞσϧʢ7-"ʣ  ࣄલֶशࡁΈ7-.Λج఺ͱ͢Δ7JTJPO-BOHVBHF"DUJPOϞσϧ  छྨͷҟͳΔϩϘοτ͔Βͷث༻ͳλεΫΛؚΉಠࣗͷσʔληοτΛ༻ֶ͍ͯश ܥ 7-"ϞσϧʹΑΔϩϘοτಈ࡞ੜ੒ɿК 1IZTJDBM*OUFMMJHFODF 

    ω0: A Vision-Language-Action Flow Model for General Robot Control Physical Intelligence Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, Ury Zhilinsky https://physicalintelligence.company/blog/pi0 Fig. 1: Our generalist robot policy uses a pre-trained vision-language model (VLM) backbone, as well as a diverse cross- embodiment dataset with a variety of dexterous manipulation tasks. The model is adapted to robot control by adding a separate action expert that produces continuous actions via flow matching, enabling precise and fluent manipulation skills. The model IUUQTXXXZPVUVCFDPNXBUDI W+65ZCM0&X
  51. w ൚༻తͳਓܕϩϘοτͷͨΊͷॳͷΦʔϓϯͳج൫Ϟσϧ  ଈԠత஌ೳʢγεςϜʣͱख़ߟత஌ೳʢγεςϜ̎ʣΛΧόʔ ܥ 7-"ϞσϧʹΑΔϩϘοτಈ࡞ੜ੒ɿ(3005/ /WJEJB  GR00T N1:

    An Open Foundation Model for Generalist Humanoid Robots Robot State “Pick up the industry object and place in yellow bin.” Joint Positions Joint Velocities Base Position EEF Poses … Tokenize Encode System 2 Vision-Language Model Image Tokens Text Tokens Encode Action Tokens Denoising System 1 Diffusion Transformer Motor Action Image Observation Language Instruction Figure 2: GR00T N1 Model Overview. Our model is a Vision-Language-Action (VLA) model that adopts a dual-system design. We convert the image observation and language instruction into a sequence of tokens to be processed by the Vision-Language Model (VLM) backbone. The VLM outputs, together with robot state and action encodings, are passed to the Di!usion Transformer module to generate motor actions. 2. GR00T N1 Foundation Model ࣮ੈք ݴޠɾ ஌ࣝܥ ஌֮ܥ ӡಈܥ γεςϜ ख़ߟత஌ೳ γεςϜ ଈԠత஌ೳ ᶃঢ়گΛཧղ͠ ᶅߦಈ͢Δ ᶄ൑அɾਪ࿦͠ ैདྷͷਂ૚ֶश ʢγεςϜ૬౰ʣ ݱࡏͷج൫Ϟσϧ ʢγεςϜͷҰ෦·ͰΧόʔʣ ࣍ੈ୅"*Ϟσϧ ʢγεςϜ ΛΧόʔʣ IUUQTXXXNFYUHPKQDPOUFOUNYU@DIPVTFJ@QEG
  52. w ϩϘοτͷΞΫγϣϯΛ༧ଌ͢ΔͨΊʹ(FNJOJΛϑΝΠϯνϡʔχϯάͨ͠7-"Ϟσϧ  Ϋϥ΢υͰϗετ͞ΕΔ7JTJPO-BOHVBHF"DUJPO 7-" όοΫϘʔϯ  ϩϘοτͷΦϯϘʔυίϯϐϡʔλ্Ͱಈ࡞͢ΔϩʔΧϧΞΫγϣϯσίʔμ ܥ Gemini

    Robotics: Bringing AI into the Physical World Figure 1 | Overview of the Gemini Robotics family of embodied AI models. Gemini 2.0 already exhibits capabilities relevant to robotics such as semantic safety understanding and long contexts. The robotics-specific training and the optional specialization processes enable the Gemini Robotics models to exhibit a variety of 7-"ϞσϧʹΑΔϩϘοτಈ࡞ੜ੒ɿ(FNJOJ3PCPUJDT (PPHMF  IUUQTXXXZPVUVCFDPNXBUDI W.W(ONN1D
  53. ϚϧνϞʔμϧ౷߹ͷ͞ΒͳΔਐԽ  ςΩετɾԻ੠ɾө૾ɾηϯα৘ใ౳ͷࢹ֮Ҏ֎ͷ৘ใͱ౷߹͞ΕͨʮϚϧνϞʔμϧೝࣝʯ͕ओྲྀ  7-.ʢ7JTJPO-BOHVBHFϞσϧʣΛى఺ͱ͠ɺঢ়گཧղɾҙਤཧղɾҼՌਪ࿦΁ͱਐԽ ʢྫɿը૾͚ͩͰͳ͘ɼͦͷഎޙͷʮͳͥͦ͏ݟ͑Δ͔ʯ·ͰΛਪ࿦͢Δ"*ʣ ੈքϞσϧͱͷ༥߹ʢ8PSME.PEFM 7JTJPOʣ  7JTJPO5SBOTGPSNFSͱڧԽֶशɾੈքϞσϧʢ8PSME.PEFMTʣͷ౷߹

     "*͕ʮݟͨ΋ͷʯΛ΋ͱʹɼະདྷͷঢ়ଶΛ༧ଌɾγϛϡϨʔτ͢Δೳྗ ʢྫɿ%*/08.ͳͲ͕ઌۦతࣄྫʣ ۭؒɾ࣌ؒεέʔϧͷ֦ு  ੩ࢭըˠಈըˠ௕ظత࣌ܥྻ΁ͱ֦ுʢྫɿ਺ेඵɾ਺෼ઌΛ༧ଌ͢Δࢹ֮"*ʣ  %Ϗδϣϯɼ/F3'ɼ(BVTTJBO4QMBUUJOHͱ͍ٕͬͨज़ͷ༥߹ʹΑΔۭؒೝࣝ 7-" && Ϟσϧʹ͓͚Δ҆શੑͷ୲อ  ෺ཧతͳ҆શੑʹؔ͢ΔڧྗͳηϚϯςΟοΫཧղ "*ʹΑΔը૾ೝٕࣝज़ͷਐԽͱࠓޙɿୈੈ୅ʹ޲͚ͯ 
  54. w ୈੈ୅ɿϋϯυΫϥϑτಛ௃  <4DISP ff #.7$`>0CKFDU$MBTT4FHNFOUBUJPOVTJOH3BOEPN'PSFTUT <7JPMB *&&&`>3BQJEPCKFDUEFUFDUJPOVTJOHBCPPTUFEDBTDBEFPGTJNQMFGFBUVSFT  <-FVOH

    *+$7`>3FQSFTFOUJOHBOE3FDPHOJ[JOHUIF7JTVBM"QQFBSBODFPG.BUFSJBMTVTJOH5ISFFEJNFOTJPOBM5FYUPOT  <,PCBZBTIJ *&&&`>"DUJPOBOE4JNVMUBOFPVT.VMUJQMF1FSTPO*EFOUJ fi DBUJPO6TJOH$VCJD)JHIFSPSEFS-PDBM"VUP $PSSFMBUJPO  <%BMBM *&&&`>)JTUPHSBNTPGPSJFOUFEHSBEJFOUTGPSIVNBOEFUFDUJPO  <8BUBOBCF *14+`>$PPDDVSSFODF)JTUPHSBNTPG0SJFOUFE(SBEJFOUTGPS)VNBO%FUFDUJPO ࢀߟจݙ 
  55. w ୈੈ୅ɿ$//ʹΑΔಛ௃දݱ֫ಘ  <3POOFCFSHFS .*$$"*`>6/FU$POWPMVUJPOBM/FUXPSLTGPS#JPNFEJDBM*NBHF4FHNFOUBUJPO  <-POH $713`>'VMMZ$POWPMVUJPOBM/FUXPSLTGPS4FNBOUJD4FHNFOUBUJPO  <#BESJOBSBZBOBO

    $713`>4FH/FU"%FFQ$POWPMVUJPOBM&ODPEFS%FDPEFS"SDIJUFDUVSFGPS3PCVTU4FNBOUJD1JYFM8JTF-BCFMMJOH  <;IBP $713`>1ZSBNJE4DFOF1BSTJOH/FUXPSL  <$IFO &$$7`>&ODPEFS%FDPEFSXJUI"USPVT4FQBSBCMF$POWPMVUJPOGPS4FNBOUJD*NBHF4FHNFOUBUJPO  <3FO /*14`>'BTUFS3$//5PXBSET3FBM5JNF0CKFDU%FUFDUJPOXJUI3FHJPO1SPQPTBM/FUXPSLT  <3FENPO $713`>:PV0OMZ-PPL0ODF6OJ fi FE 3FBM5JNF0CKFDU%FUFDUJPO  <-JV &$$7`>44%4JOHMF4IPU.VMUJ#PY%FUFDUPS  <-JO $713`>'FBUVSF1ZSBNJE/FUXPSLTGPS0CKFDU%FUFDUJPO  <;IPV BS9JW`>0CKFDUTBT1PJOUT  <;IBP """*`>.%FU"4JOHMF4IPU0CKFDU%FUFDUPSCBTFEPO.VMUJ-FWFM'FBUVSF1ZSBNJE/FUXPSL  <)F *$$7`>.BTL3$//  <,SJ[IFWTLZ /*14`>*NBHFOFUDMBTTJ fi DBUJPOXJUIEFFQDPOWPMVUJPOBMOFVSBMOFUXPSLT  <4JNPOZBO *$-3`>7&3:%&&1$0/70-65*0/"-/&5803,4'03-"3(&4$"-&*."(&3&$0(/*5*0/  <)F *-473$`>%FFQ3FTJEVBM-FBSOJOHGPS*NBHF3FDPHOJUJPO  <)V $713`>4RVFF[FBOE&YDJUBUJPO/FUXPSLT  <5BO 1.-3`>& ffi DJFOU/FU3FUIJOLJOH.PEFM4DBMJOHGPS$POWPMVUJPOBM/FVSBM/FUXPSLT  <4[FHFEZ $713`>(PJOHEFFQFSXJUIDPOWPMVUJPOT  <4BSMJO $713`>4VQFS(MVF-FBSOJOH'FBUVSF.BUDIJOHXJUI(SBQI/FVSBM/FUXPSLT ࢀߟจݙ 
  56. w ୈੈ୅ɿ7J5ʹΑΔಛ௃දݱ֫ಘ  <%PTPWJUTLJZ *$-3`>"O*NBHFJT8PSUIY8PSET5SBOTGPSNFSTGPS*NBHF3FDPHOJUJPOBU4DBMF  <$BSJPO &$$7>&OEUP&OE0CKFDU%FUFDUJPOXJUI5SBOTGPSNFST  <9JF

    /FVS*14`>4FH'PSNFS4JNQMFBOE& ffi DJFOU%FTJHOGPS4FNBOUJD4FHNFOUBUJPOXJUI5SBOTGPSNFST  <$BSPO *$$7>&NFSHJOH1SPQFSUJFTJO4FMG4VQFSWJTFE7JTJPO5SBOTGPSNFST  <3BEGPSE 1.-3`>-FBSOJOH5SBOTGFSBCMF7JTVBM.PEFMT'SPN/BUVSBM-BOHVBHF4VQFSWJTJPO w ୈੈ୅ɿ7-.  <"OJM BS9JW`>(FNJOJ"'BNJMZPG)JHIMZ$BQBCMF.VMUJNPEBM.PEFMT  <-JV /FVS*14`>7JTVBM*OTUSVDUJPO5VOJOH  <3BNFTI BS9JW`>)JFSBSDIJDBM5FYU$POEJUJPOBM*NBHF(FOFSBUJPOXJUI$-*1-BUFOUT  <)XBOH BS9JW`>&.."&OEUP&OE.VMUJNPEBM.PEFMGPS"VUPOPNPVT%SJWJOH  <#KPSDL BS9JW`>(35/"O0QFO'PVOEBUJPO.PEFMGPS(FOFSBMJTU)VNBOPJE3PCPUT  <,JSJMMPW *$$7`>4FHNFOU"OZUIJOH  <-JBOH $713`>0QFO7PDBCVMBSZ4FNBOUJD4FHNFOUBUJPOXJUI.BTLBEBQUFE$-*1  <$IFOH $713`>:0-08PSME3FBM5JNF0QFO7PDBCVMBSZ0CKFDU%FUFDUJPO  <%FWMJO "$-`>#&351SFUSBJOJOHPG%FFQ#JEJSFDUJPOBM5SBOTGPSNFSTGPS-BOHVBHF6OEFSTUBOEJOH  <3BEGPSE >*NQSPWJOH-BOHVBHF6OEFSTUBOEJOHCZ(FOFSBUJWF1SF5SBJOJOH  <)F $713`>.PNFOUVN$POUSBTUGPS6OTVQFSWJTFE7JTVBM3FQSFTFOUBUJPO-FBSOJOH ࢀߟจݙ