言語モデル(画像キャプションモデル)を用いた手法 視覚・動作情報 キャプションモデル 指示文章 [3] テキストによる主観視点動画からの物体追跡 “the large white bowl with broccoli inside that is used to load the pan of broccol” ScanQA (CVPR2022). RefEgo (ICCV2023). Generative Language Grounded Policy (ICLR2021). • 文書処理 • 図表の読解 • 文書質問応答 • OCR • 実世界認識 • 参照表現理解 Visual grounding • 一人称動画理解 • ロボット応用
the hands of the person. A red crate on the flat shopping cart in the middle of the isle. A small blue plate of broccoli to left of other plate. The red container near the wall, behind the two trays. Garage Kitchen Lab Supermarket
• We constructed a object localization & tracking tdataset on Ego4D • 12,038 annotated clips of 41 hours total. • 2FPS for annotation bboxes with two textual referring expressions for a single object. • Objects can be out-of-frame (no-referred-object).
strainer inside the kitchen sink MDETR: 0.110 MDETR: 0.908 MDETR: 0.998 MDETR: 0.991 The referred object is difficult to detect The brown box with red writing, sitting on top of a blue box on the table