論文紹介 / Connecting Vision and Language with Video Localized Narratives

論文紹介 / Connecting Vision and Language with Video Localized Narratives

第59回 コンピュータビジョン勉強会@関東 CVPR2023読み会(前編) にて、
"Connecting Vision and Language with Video Localized Narratives" [Voigtlaender et al., CVPR 2023]

◆イベント詳細 URL:
◆紹介論文の PDF

(2023/09/02 スライドを修正版に差し替えました。誤字脱字の修正等を行なっています。)

Yusuke Mori

July 23, 2023

  1. Connecting Vision and Language with Video Localized Narratives shade-tree Twitter:

    @shade_tree2112 [PDF] [Project Page of the paper] 第59回 コンピュータビジョン勉強会@関東 CVPR 2023 読み会 2023/07/23 1 このスライド
  2. ࠓճ঺հ͢Δ࿦จΛܾΊΔաఔͰɺߟ͑ͨ͜ͱ • ʮAward Candidates Ͱ͋Δ “Ego-Body Pose Estimation via Ego-Head

    Pose Estimation” ͸ɺલճ঺հͯ͠͠·ͬͨ……ʯʢએ఻ʣ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 3 Ego-Body Pose Estimation via Ego-Head Pose Estimation shade-tree Twitter: @shade_tree2112 [Project Page] 第58回 コンピュータビジョン勉強会@関東 「深層学習+3D論⽂読み会」 2023/04/30 1
  3. ࠓճ঺հ͢Δ࿦จΛܾΊΔաఔͰɺߟ͑ͨ͜ͱ • CVPR 2023 Accepted Papers • https://cvpr2023.thecvf.com/Conferences/2023/AcceptedPapers • ͜͜Ͱɺaward

    candidates, highlights ͳͲͷ৘ใΛؚΉϦετ͕ެ։ • https://docs.google.com/spreadsheets/d/1OAUf7sQfJ6cSU4BiOtyl- t4dMm4iFqdEDHCSs7R2jZo • ͜ͷϦετ͔Β୳͢ • Highlight Ͱݕࡧ͢ΔɺStory, Stories, Narrative ͱ͔Ͱݕࡧ͢Δ • “Vision and Language” Ͱ “Narratives” ! ͜Εʹ͠Α͏ʂ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 4
  4. ൃදऀͷཱ৔ɾࢹ఺ • CV ษڧձͰ͸ɺҎԼͷ࿦จΛ঺հ͍͖ͤͯͨͩ͞·ͨ͠ • “A Hierarchical Approach for Generating

    Descriptive Image Paragraphs” [Krause et al., CVPR 2017] • “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” [Dosovitskiy et al., ICLR 2021] • “Transitional Adaptation of Pretrained Models for Visual Storytelling” [Yu et al., CVPR 2021] • “Panoptic Narrative Grounding” [González et al., ICCV 2021] • “GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation” [Xu et al., ICLR 2022] • “It is Okay to Not be Okay: Overcoming Emotional Bias in Affective Image Captioning by Contrastive Data Collection” [Mohamed et al., CVPR 2022] • “Ego-Body Pose Estimation via Ego-Head Pose Estimation” [Li et al., CVPR 2023] • ͜ͷϖʔδͰݴ͍͍ͨ͜ͱɿ ʢࣗ෼ͷ΋ͷ͸੿͍͕ɺʣCV ษڧձͰ͸༷ʑͳ࿦จ͕঺հ͞Εɺϩά΋ࢀߟʹͳΔʂ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 5
  5. ࠓճͷ࿦จͱಛʹؔ܎͢Δ΋ͷ • CV ษڧձͰ͸ɺҎԼͷ࿦จΛ঺հ͍͖ͤͯͨͩ͞·ͨ͠ • “A Hierarchical Approach for Generating

    Descriptive Image Paragraphs” [Krause et al., CVPR 2017] • “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” [Dosovitskiy et al., ICLR 2021] • “Transitional Adaptation of Pretrained Models for Visual Storytelling” [Yu et al., CVPR 2021] • “Panoptic Narrative Grounding” [González et al., ICCV 2021] • “GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation” [Xu et al., ICLR 2022] • “It is Okay to Not be Okay: Overcoming Emotional Bias in Affective Image Captioning by Contrastive Data Collection” [Mohamed et al., CVPR 2022] • “Ego-Body Pose Estimation via Ego-Head Pose Estimation” [Li et al., CVPR 2023] 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 6
  6. Localized Narratives ͷܥේ • Localized Narratives [Pont-Tuset et al., ECCV

    2020] • Panoptic Narrative Grounding [González et al., ICCV 2021] • Ҏલʹ঺հͨ͠΋ͷ • Video Localized Narratives [Voigtlaender et al., CVPR 2023] • ࠓճ঺հ͢Δ΋ͷ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 8
  7. Localized Narratives [Pont-Tuset et al., ECCV 2020] 2023/07/23 9 •

    ϚϧνϞʔμϧը૾ΞϊςʔγϣϯͷܗࣜΛఏҊ • Ի੠ͰΞϊςʔγϣϯΛߦ͏ͱಉ࣌ʹɺ஫ࢹ͢Δ෦෼ΛϚ΢εΦʔόʔʢϗόʔʣ • Ի੠ͱϚ΢εͷಈ͖͕ಉظ͍ͯ͠ΔͷͰɺ֤୯ޠͱը૾தͷҐஔΛରԠ෇͚ΒΕΔ ୈճ$7ษڧձ!ؔ౦
  8. [෮श] Panoptic Narrative Grounding ʢ̍ʣ • ͜ͷ࿦จͰఏҊ͞ΕͨλεΫͷ໊শ • ୯ޠϨϕϧ͔Β΋͏গ͠ৄ͘͠ݟ͍ͯ͘ •

    Panoptic ͬͯʁ • ύϊϥϚతͳɺશମΛҰ໨Ͱݟ౉ͤΔ • pan (= all) + optic • Panoptic Segmentation [Kirillov et al., CVPR 2019] Λ࢖͏ • Narrative ͬͯʁ • Story ͱಉٛͰ࢖ΘΕΔ͜ͱ΋͋Δ͕ɺ࢖͍෼͚Δ৔߹ɺΑΓ޿͍֓೦ɻ༁ ͠෼͚ΔͳΒɺNarrative ͸ʮޠΓʯ • ͨͱ͑͹ɺ͜ͷલ޲্͸ɺStory Ͱ͸ͳ͍͔΋͠Εͳ͍͕ Narrative ͩͱ͸ݴ͑Δ • Localized Narratives [Pont-Tuset et al., ECCV 2020] ͕ॏཁͳઌߦݚڀ 2023/07/23 10 ୈճ$7ษڧձ!ؔ౦
  9. [෮श] Panoptic Narrative Grounding ʢ̎ʣ • Grounding ͬͯʁ • ͜͜Ͱ͸ɺγϯϘϧάϥ΢ϯσΟϯάʢSymbol

    Groundingʣͱಉ༷ͷ ࢖ΘΕํ • ը૾Λ࢖ͬͨ Grounding ͳͷͰɺ Visual Grounding ͱ͍͏༻ޠ͕࿦จதʹසग़ • Symbol Grounding ͸ Harnad [1990] ʹΑΓఏএ [link] • This paper describes the "symbol grounding problem": How can the semantic interpretation of a formal symbol system be made intrinsic to the system, rather than just parasitic on the meanings in our heads? How can the meanings of the meaningless symbol tokens, manipulated solely on the basis of their (arbitrary) shapes, be grounded in anything but other meaningless symbols? 2023/07/23 11 ୈճ$7ษڧձ!ؔ౦
  10. [෮श] Panoptic Narrative Grounding ʢ̏ʣ • ͬ͘͟ΓͱɺҎԼͷΑ͏ͳ΋ͷͰ͋Δͱߟ͑ΒΕΔ • શମΛҰ໨Ͱݟ౉ͤΔΑ͏ͳը૾ʹɺ •

    ୯ޠͰ͸ͳ͘ Narrative Λɺ • άϥ΢ϯσΟϯά͢Δʢ઀஍͢ΔɺରԠ෇͚Δʣ 2023/07/23 12 ୈճ$7ษڧձ!ؔ౦
  11. [෮श] Panoptic Narrative Grounding ʢ̐ʣ • ը૾ͱͷ૊Έ߹ΘͤʹΑΔݴޠͷάϥ΢ϯσΟϯάʢVisual GroundingʣͷϑϨʔϜϫʔΫ͸͜Ε·Ͱʹ΋͋Δ͕ɺཻ౓ʹ ໰୊͕͋Δ •

    ݴޠʹৄ͍͠Ξϊςʔγϣϯ͕෇͍ͨ΋ͷ͸ଘࡏ͢Δ͕ɺ ը૾ͷΞϊςʔγϣϯ͸·ͩૄͰૈ͍ 2023/07/23 13 ୈճ$7ษڧձ!ؔ౦
  12. [෮श] Panoptic Narrative Grounding ʢ̑ʣ • Panoptic Narrative Grounding ͷఏҊ

    • panoptic segmentation regions Λ Visual Grounding ͱͯ͠༻͍ɺ natural language visual grounding problem Λ৽ͨʹఆࣜԽ • λεΫɺσʔληοτɺࢦඪɺख๏ΛఏҊͨ͠ 2023/07/23 14 全部乗せ! ୈճ$7ษڧձ!ؔ౦
  13. ঺հ࿦จͷ֓ཁʢ̍ʣ • Video Localized Narratives (VidLN) ΛఏҊ • ੩ࢭըͰ͸ͳ͘ɺಈըͰͷ Localized

    Narratives • ಈըΛΞϊςʔγϣϯ͢Ε͹ɺ෺ޠશମΛݟΒΕΔ • ΠϕϯτͷྲྀΕ͕͋Δ • ෳ਺ͷਓ෺΍ΦϒδΣΫτ͕ొ৔͠ɺ૬ޓʹ࡞༻͢Δ • ͔͠͠ɺ੩ࢭըʹൺ΂ͯΞϊςʔγϣϯ͕೉͍͠ • Ξϊςʔλʔʹͱͬͯɺ࣌ؒͱͷڝ૪ (race against time) 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 16
  14. ࿦จͷߏ੒ • Abstract • 1. Introduction • 2. Related Work

    • “Tasks Related to VNG.” ͷ߲ͰɺPanoptic Narrative Grounding ͱͷࠩҟʹݴٴ • 3. Video Localized Narrative Annotations • Ξϊςʔγϣϯͷํ๏ʹ͍ͭͯ • 4. Video Narrative Grounding (VNG) • ϕϯνϚʔΫͱͯ͠ͷԠ༻ɿྫ̍ • 5. Video Question Answering (VideoQA) • ϕϯνϚʔΫͱͯ͠ͷԠ༻ɿྫ̎ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 20
  15. Panoptic Narrative Grounding ͱͷࠩҟ • Panoptic Narrative Grounding • શମΛݟΔΑ͏ͳηάϝϯςʔγϣϯΛߦ͏

    • ੩ࢭըͷΩϟϓγϣϯʹݱΕΔ໊ࢺͷάϥ΢ϯσΟϯά • Video Narrative Grounding (VNG,ࠓճѻ͏λεΫͷҰͭ) • ಈըΛର৅ͱ͠ɺ۩ମతͳΦϒδΣΫτΛର৅ͱ͢Δ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 21
  16. ̑εςοϓͷΞϊςʔγϣϯ ᶃ Understand the Video ᶄ Actor Selection ᶅ Key-frame

    Selection ᶆ A Story for each Actor ᶇ Transcription and Time Alignment 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 23
  17. ̑εςοϓͷΞϊςʔγϣϯ ᶆ Speak and move mouse ΞϊςʔγϣϯͷϝΠϯύʔτ • ݸผͷΞΫλʔʹ͍ͭͯɺબ୒͞ΕͨΩʔϑϨʔϜʹର͠ɺޱ಄Ͱઆ໌ •

    ΞΫλʔ໊ɺΞτϦϏϡʔτɺΞΫγϣϯɺ૬ޓ࡞༻͍ͯ͠ΔଞͷΦϒδΣΫτ • આ໌͠ͳ͕ΒɺϚ΢εϙΠϯλʔΛಈ͔͠ɺΦϒδΣΫτ΍ΞΫγϣϯΛࢦࣔ͢͠ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 26
  18. Statistics • ̏ͭͷσʔληοτʹΞϊςʔγϣϯΛ෇༩ • OVIS [Qi et al., IJCV 2022],

    UVO [Sadhu et al., NAACL 2021] , Oops [Epstein et al., CVPR 2020] • Total • 20K videos • 1,65M words • 3.54 actors / video 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 29
  19. Statistics: ؔ࿈σʔληοτͱͷൺֱʢٙ໰఺ʣ • Actors / narr ʹ͍ͭͯɺ̏छͰ max 3.45 ͳͷʹɺ

    All ͩͱ 3.54? • 71,976 / 22,091 = 3.26 Ͱܭࢉ݁Ռ͕߹Θͳ͍ • ࿦จதͷิ଍ࣄ߲Λݟམͱ͍ͯ͠Δʁ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 31
  20. Statistics: ΩϟϓγϣϯͷϦον͞ͷ֬ೝ • ActivityNet-Entities [Zhou et al., CVPR 2019] ͱ

    ൺֱͯ͠ɺϦονͳΩϟϓγϣϯ • ௕͘ʢฏۉ 75.1 wordsʣɺ඼ࢺผͰ΋ ଟ͘ͷ୯ޠΛؚΉ • 23.0 nouns • 9.5 verbs • 8.5 adjectives • 7.2 adpositions • 2.4 pronouns 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 32
  21. Statistics: Ωϟϓγϣϯͷਖ਼֬ੑ • Semantic Accuracy ͱ Localization Accuracy ͷ֬ೝ •

    Semantic Accuracy ʢ্දʣ • 70 ಈըΛϥϯμϜʹબͼɺ໊ࢺ۟ͱಈࢺΛ֬ೝ → ΄΅׬ᘳ • Localization Accuracy • ੩ࢭըͷ Localized Narrative (ImLN) ͷσʔληοτͱൺ΂ͯ΋ߴਫ਼౓ʢ+25%ʣ → ఏҊͨ͠ϓϩτίϧ͸༗༻ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 33
  22. VNG: ख๏ • ReferFormer [Wu et al., CVPR 2022] Λվྑͨ͠ɺReferFormer-VNG

    • ݩͷख๏ • Referring Video Object Segmentation (R-VOS) λεΫ༻ͷ΋ͷ • ୹͍આ໌ɺ୯ҰͷΦϒδΣΫτ • Visual Encoder Ͱɺಈը͔Βಛ௃நग़ • Text Encoder ͰɺςΩετ͔Βಛ௃நग़ • ͜ΕΒͷಛ௃Λ Decoder ʹೖΕɺ֤ϑϨʔϜʹ͓͍ͯϚεΫΛੜ੒ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 37
  23. VNG: ख๏ • VNG Ͱ͸ɺಉ໊͡ࢺͰද͞ΕΔෳ਺ͷΦϒδΣΫτ͕͋Δ • Text Encoder ͷॲཧΛมߋ •

    ReferFormerɿ τʔΫϯ͝ͱͷಛ௃ + ςΩετશମͷಛ௃ • ςΩετશମ͕ಉҰͷΦϒδΣΫτΛࢦࣔ͢͠ͷͰɺ༗༻ • ReferFormer-VNGɿ ηάϝϯτ໊͍ͨ͠ࢺʹରԠͨ͠τʔΫϯͷಛ௃ͷ average-pool • ʢະ֬ೝʣτʔΫϯ͝ͱͷಛ௃ͱ average-pool ͷ྆ํ͔ɺޙऀ͚͔ͩ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 38
  24. VNG: ධՁࢦඪͱɺϕʔεϥΠϯͷ݁Ռ • ධՁࢦඪɿ 𝒯&ℱ [Pont-Tuset et al., 2018] •

    The 2017 DAVIS Challenge on Video Object Segmentation ͷ΋ͷ • ϕʔεϥΠϯɿReferFormer • VNG λεΫͷੑ্࣭ɺ Narrative શମΛೖΕΔΑΓɺNoun ͚ͩͷ΄͏͕ͪΐͬͱྑ͍ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 39
  25. VideoQA • Text-output Questions • ࣗ༝هड़ͷճ౴ ※ ൃද࣌ؒΛߟྀͯ͠ɺઆ໌ΛׂѪ͠·͢ʢεϥΠυ͸ Appendix ʹ͋Γ·͢ʣ

    • Location-output Questions • “where is” Ͱ࢝·Δ࣭໰ʹɺಈը্Ͱͷ࣌ؒͱۭؒͷಛఆͰճ౴ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 42
  26. VideoQA: Location-output ͷ Q&A ࡞੒ • ࣗಈੜ੒ • spaCy Λ࢖ͬͯ

    part-of-speech λά΍ parse tree Λ෇༩ • “where the subject is” ͷ࣭໰ʹม׵ • 3.7 questions / video • Ϛ΢εͷτϨʔεͷ৘ใΛ࢖ͬͯਖ਼ղΛ࡞Δ • ਓؒʹΑΔνΣοΫ • ̎ਓ͕νΣοΫɺͲͪΒ͔Β΋߹֨ͱͳͬͨ΋ͷͷΈ࢒͢ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 43
  27. VideoQA: Location-output ͷ݁Ռ • ReferFormer-VNG ΛϕʔεϥΠϯͱͯ͠࢖༻ • “where is” ͱ͍͏୯ޠΛ࣭໰͔Β࡟Γɺจதͷ࠷ॳͷ໊ࢺΛର৅ʹ

    ReferFormer-VNG ͰηάϝϯςʔγϣϯϚεΫΛग़ྗ • ϚεΫΛ bounding box ʹม׵ • ݁Ռ • Recall: 66.7% • Precision: 53.9% • ྑ͍݁Ռ͕ͩ׬ᘳʹ͸ఔଟ͍ɻఏҊͨ͠ϕϯνϚʔΫʹ͸Ձ஋͕͋Δ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 45
  28. Story ͱ Narrative ʹ͍ͭͯɺ΋͏গ͠ৄ͘͠ • Oxford Learner’s Dictionary of Academic

    English [2014] ʹΑΔ ͱɺ • Story 1. a description, often spoken, of what happened to sb or of how sth happened • Narrative 1. [C] a description of events, especially in a novel [synonym] story (1) 2. [U] the act, process or skill of telling a story • Localized Narratives ͷ Narrative ཁૉ͸ʁ • ʢΞϊςʔλͷ஫ࢹ͢Δʣ࣌ܥྻΛ࢖͏ • ࠓճ͸ͦ΋ͦ΋ Video Ͱɺ Narrative શମΛѻ͑Δ 2023/07/23 48 ୈճ$7ษڧձ!ؔ౦
  29. VideoQA: Text-output ͷ Q&A ࡞੒ • ࣗಈੜ੒ • VQ2A [Changpinyo

    et al., NAACL 2022] ʹΑΔ • ࣗಈͰɺճ౴͕ॏෳ͢Δ΋ͷΛআ͘ͳͲͷॲཧΛߦ͏ • ਓؒʹΑΔνΣοΫ • “What color is the sky?” • ಈըΛݟͳͯ͘΋ճ౴Ͱ͖Δ΋ͷ • “What color is the cat?” • ಈըʹෳ਺ͷೣ͕ग़͍ͯͨΒɺճ౴͕ᐆດ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 49
  30. VideoQA: Text-output ͷ݁Ռ • PaLI [Chen et al., ICLR 2023]

    ΛϕʔεϥΠϯͱͯ͠࢖༻ • PaLI-1.5B • Ground-truth ͱͷ exact-match ͰධՁ • ୯ҰϑϨʔϜͰͷ݁Ռ • Zero-shot Ͱ 24.1% • Fine-tuned (on Oops-QA) Ͱ 44.9% • ఏҊͨ͠ϕϯνϚʔΫ͸ෳࡶͳ΋ͷͰɺख๏վળͷ༨஍͋Γ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 51