and Editing テキスト/画像内容に基づき、生成/編集する 2 Recognition and Description 画像中の物体を認識し、画像の説明文を出力する 3 Localization 画像中の物体を認識し、その物体の位置情報を出力する 4 OCR and Reasoning 画像内のテキストを認識し、そのテキストを出力する 参考文献:From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness, and Causality through Four Modalities
a penguin with colorful background Image Creator from Microsoft Designer kite-surfer in the ocean at sunset Structure and Content-Guided Video Synthesis with Diffusion Models
フレームの抽出 説明文の生成 ステップ1 ステップ2 ステップ3 The image shows a curving road veering to the right with a white guardrail on the side. .. The image shows a curving road veering to the right with a white guardrail on the side. .. Average Speed of this car: slow Does this car turn left in this movie?: Car turns left. 画像説明文 メタデータ + In this video thumbnail image taken by a car's drive recorder, we see a sunny day with a road ahead. .. Curving road
dog in the image using bounding box. 出典:From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness, and Causality through Four Modalities
dog in the image using bounding box. Question: How many dogs are in the image? There are eleven dogs in the image 出典:From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness, and Causality through Four Modalities
present in the picture? ChEF decouples the evaluation pipeline into four components:• Scenario: A set of datasets concerning representative multimodal tasks that are suitable for MLLMs… 出典:From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness, and Causality through Four Modalities
is present in the picture? Question: Choose the appropriate shape to replace the shape that is missing. ChEF decouples the evaluation pipeline into four components:• Scenario: A set of datasets concerning representative multimodal tasks that are suitable for MLLMs… the solution to the puzzle is to place the number 3 in the spot marked with a question mark. This maintains a consistent pattern of differences in both the rows and the columns of the grid 出典:From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness, and Causality through Four Modalities
Generation and Editing テキスト内容に基づき、動画を生成/編集する 2 Video Search 特定の動画コンテンツを検索する 3 Video Description and Summarization 動画の説明やストーリーを作成する 4 Video Classification 動画を事前に定義されたクラスやトピックに自動的に分類する 5 Video Question Answering 視覚情報と言語情報に基づき、動画に関連する質問に答える 参考文献:From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness, and Causality through Four Modalities
Models ⚫ From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness, and Causality through Four Modalities ⚫ Identifying Geographical Location of the Image ⚫ 生成AI時代におけるUXデザイン | 生成AIをフル活用したUX設計手法&生成AI時代のユーザー体験の変化について ⚫ screenshot-to-code ⚫ A Survey on Hallucination in Large Vision-Language Models ⚫ Red-Teaming the Stable Diffusion Safety Filter ⚫ DALL-E 2などの画像生成AIに対する敵対的攻撃 ⚫ Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs ⚫ Evaluating Object Hallucination in Large Vision-Language Models ⚫ MM-Vid:Advancing Video Understanding with GPT-4V(ision) ⚫ A Tour of Video Understanding Use Cases ⚫ 伊藤園、生成AIでCMモデル 「お~いお茶」SNSで拡散 ⚫ Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Understanding System ⚫ Diagrams: Show Me ⚫ グーグルのAI「Bard」が劇的進化、YouTube動画の要約や質問が可能に ⚫ GoogleのチャットAI「Bard」でYouTube動画の内容を要約させることが可能に、コンテンツ作成者に悪影響が及ぶ懸念も ⚫ Deep Learning-Based Anomaly Detection in Video Surveillance: A Survey ⚫ Top 18 Applications of Computer Vision in Security and Surveillance ⚫ UCF Sports Action Data Set ⚫ MM-LLMs: Recent Advances in MultiModal Large Language Models ⚫ Adversarial Attacks on Image Generation With Made-Up Words ⚫ Hallucination Leaderboard