マルチモーダル生成AIの最前線～アプリケーションと考えるべきリスク～

マルチモーダルな生成AI ～アプリケーションと考慮するべきリスク～ 2024年2月15日（木）NTTデータグループ中島佑允

目次 1 1. はじめに 2. 自己紹介 3. マルチモーダル生成AIの全体像 4. 代表的なモデル
5. テキスト×画像 6. テキスト×動画 7. マルチモーダル生成AIに関するリスク 8. まとめ

1. はじめに 2

1. はじめに 3

1. はじめに 4 参考文献：Identifying Geographical Location of the Image

1. はじめに 5 ⚫ 昨今、ChatGPTやBard（Gemini）をはじめとして、生成AIが様々な業界で大きな注目を集めています ⚫ 一方で、言語情報だけではなく、画像情報等を扱うことのできる「マルチモーダルな生成AI」については、まだまだ普及しているとは言えない状況です ⚫ 本セミナーでは、マルチモーダルな生成AIの最前線とアプリケーション、そして、活用する際に考えるべきリスク
等についてお話させていただきます ※ 本セミナーでの内容や発言は中島個人によるものです。 ※ セミナー後に、資料はX（旧Twitter）上で公開します。 ※ 本日、紹介した技術やサービスは倫理的な問題等によって、使えなくなる可能性があります。 ※ 時間の関係上、技術的な細部まで踏み込むことはできません。ご了承ください。

2. 自己紹介 6

2. 自己紹介 7 ⚫ 名前：中島佑允（なかじまゆうすけ） ⚫ 所属： •
NTTデータグループ（本業） ✓ 新規営業機械学習案件等に従事（～2023年3月末） ✓ サイバーセキュリティ技術部（2023年4月～） • JDLA（日本ディープラーニング協会）人材育成業務担当 • AI-SCHOLAR ライター ⚫ 趣味：テニス、カラオケなど ⚫ X（旧Twitter）アカウント：@nakajimeeee (eが4つ)

3. マルチモーダル生成AIの全体像 8

3. マルチモーダル生成AIの全体像 9 マルチモーダルな生成AI モーダル

3. マルチモーダル生成AIの全体像 10 ⚫ 本セミナーでは、テキストに加えて、画像/動画を処理可能な生成AI（Vision-Language Models）を扱う ⚫ 例 • テキストから画像を出力（Text
To Image） • 代表的なモデル：DALL·E, Stable Diffusion Vision-Language Models（VLMs）

3. マルチモーダル生成AIの全体像 11 出典：MM-LLMs: Recent Advances in MultiModal Large Language
Models

4. 代表的なモデル 12

4. 代表的なモデル 13 参考文献： • MM-LLMs: Recent Advances in MultiModal
Large Language Models • Hallucination Leaderboard ⚫ マルチモーダル生成AIの代表的なモデルは以下の二つである • GPT-4V ✓ OpenAI社が2023年3月に発表したマルチモーダル対応のモデル ✓ 各種タスクの性能だけではなく、出力の信頼性や一貫性の面からも非常に優れたモデル ✓ OSSのモデルを評価する際に、GPT-4をベースに評価することが多い • Gemini • Google社が2023年12月に発表したマルチモーダル対応のモデル • マルチモーダルタスクにおいては、GPT-4Vを圧倒する性能を出したという報告もある • YouTubeやGoogle Driveといった各種プラットフォームと連携可能である

5. テキスト×画像 14

5. テキスト×画像 15 ⚫ 紹介するアプリケーションは以下の通り No アプリケーション説明 1 Generation
and Editing テキスト/画像内容に基づき、生成/編集する 2 Recognition and Description 画像中の物体を認識し、画像の説明文を出力する 3 Localization 画像中の物体を認識し、その物体の位置情報を出力する 4 OCR and Reasoning 画像内のテキストを認識し、そのテキストを出力する参考文献：From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness, and Causality through Four Modalities

5. テキスト×画像 ⚫ Generation and Editing 3D render of a
penguin with colorful background Image Creator from Microsoft Designer

5. テキスト×画像 17 ⚫ Generation and Editing 3D render of
a penguin with colorful background Image Creator from Microsoft Designer kite-surfer in the ocean at sunset Structure and Content-Guided Video Synthesis with Diffusion Models

5. テキスト×画像 18 ⚫ Generation and Editing • シーケンス図の作成 ✓
ある動作に対するシステム処理の流れを視覚的に表現するシーケンス図を、GPTsを用いて作成する出典：Diagrams: Show Me

5. テキスト×画像 19 ⚫ Generation and Editing • ストーリーボードの作成 ✓
ユーザーにとっての理想的な体験の流れをイラストを使いストーリー立ててビジュアライズしていく、 UXデザイン上で非常に重要なストーリーボードの作成を、生成AIを用いて効率化する出典：生成AI時代におけるUXデザイン | 生成AIをフル活用したUX設計手法＆生成AI時代のユーザー体験の変化について

5. テキスト×画像 20 ⚫ Generation and Editing • ストーリーボードの作成 ✓
ユーザーにとっての理想的な体験の流れをイラストを使いストーリー立ててビジュアライズしていく、 UXデザイン上で非常に重要なストーリーボードの作成を、生成AIを用いて効率化する出典：生成AI時代におけるUXデザイン | 生成AIをフル活用したUX設計手法＆生成AI時代のユーザー体験の変化について

5. テキスト×画像 21 ⚫ Generation and Editing • ウェブページ作成（screenshot-to-code） ✓
参考にしたいウェブページのスクリーンショットをHTML/CSSに変換。生成した画面を自然文で編集可能出典：screenshot-to-code

5. テキスト×画像 22 ⚫ Recognition and Description

5. テキスト×画像 23 ⚫ Recognition and Description • 動画検索システムの作成（Turing株式会社さんの事例） •
課題 ✓ これまではモデルを作成する際に、人が動画を見て、特定の部分を切り出す作業が必要であった ✓ 上記の作業には時間がかかり、かつ、多様なデータが手に入りにくい • アプローチ 1. 動画から一定間隔でフレームを抽出し、それぞれの説明文を生成する 2. さらに、それに紐づくメタデータ（速度、ハンドル等）を結合し、GPT-3.5によって、自然文を作成する 3. その自然文に対して、クエリを投げて、類似のフレームだけを抽出し、効率的な動画検索を実現した参考文献：Bardのようなimage2textAIを構築して動画検索システムを作る

5. テキスト×画像 24 ⚫ Recognition and Description • 動画検索システムの作成（Turing株式会社さんの事例）参考文献：Bardのようなimage2textAIを構築して動画検索システムを作る
フレームの抽出説明文の生成ステップ1 ステップ2 ステップ3 The image shows a curving road veering to the right with a white guardrail on the side. .. The image shows a curving road veering to the right with a white guardrail on the side. .. Average Speed of this car: slow Does this car turn left in this movie?: Car turns left. 画像説明文メタデータ＋ In this video thumbnail image taken by a car's drive recorder, we see a sunny day with a road ahead. .. Curving road

5. テキスト×画像 25 ⚫ Localization Question: Localize each person and
dog in the image using bounding box. 出典：From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness, and Causality through Four Modalities

5. テキスト×画像 26 ⚫ Localization Question: Localize each person and
dog in the image using bounding box. Question: How many dogs are in the image? There are eleven dogs in the image 出典：From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness, and Causality through Four Modalities

5. テキスト×画像 ⚫ OCR and Reasoning Question: What text is
present in the picture? ChEF decouples the evaluation pipeline into four components:• Scenario: A set of datasets concerning representative multimodal tasks that are suitable for MLLMs… 出典：From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness, and Causality through Four Modalities

5. テキスト×画像 28 ⚫ OCR and Reasoning Question: What text
is present in the picture? Question: Choose the appropriate shape to replace the shape that is missing. ChEF decouples the evaluation pipeline into four components:• Scenario: A set of datasets concerning representative multimodal tasks that are suitable for MLLMs… the solution to the puzzle is to place the number 3 in the spot marked with a question mark. This maintains a consistent pattern of differences in both the rows and the columns of the grid 出典：From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness, and Causality through Four Modalities

5. テキスト×画像 29 ⚫ OCR and Reasoning • フローチャートのコード変換 ✓
フローチャートを画像として読み込み、PythonやJavaなどのプログラミング言語として変換する出典：若手プログラマー保存版！フローチャート徹底解説と作成カンニングペーパー

5. テキスト×画像 30 ⚫ OCR and Reasoning • ウェブサイトの改ざん検知 ✓
不正アクセスによって、企業のウェブサイトが改ざんされる被害は、月200-300件程度発生している ✓ 改ざんに気づかないと、顧客を悪意のあるウェブサイトに誘導してしまう可能性がある ✓ VLMsを用いて、正規のウェブ画面を元にして、改ざん検知を行う参考文献：JPCERT/CC インシデント報告対応レポート2022 年 1 月 1 日～ 2022 年 3 月 31 日

6. テキスト×動画 31

6. テキスト×動画 32 ⚫ 紹介するアプリケーションは以下の通り No アプリケーション説明 1 Video
Generation and Editing テキスト内容に基づき、動画を生成/編集する 2 Video Search 特定の動画コンテンツを検索する 3 Video Description and Summarization 動画の説明やストーリーを作成する 4 Video Classification 動画を事前に定義されたクラスやトピックに自動的に分類する 5 Video Question Answering 視覚情報と言語情報に基づき、動画に関連する質問に答える参考文献：From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness, and Causality through Four Modalities

6. テキスト×動画 ⚫ Video Generation and Editing Men and women
walking in spectacular nature 出典：Runway

6. テキスト×動画 ⚫ Video Generation and Editing • テレビCMの作成（伊藤園さんの事例） ✓
テレビCMに生成AIで作成したモデルを起用出典：お～いお茶カテキン緑茶TVーCM 「未来を変えるのは、今！」篇

6. テキスト×動画 ⚫ Video Search • Google Gemini（旧Bard）を用いたYouTube動画の検索 ✓ GeminiはGoogleの提供するサービスと連携することが可能である
✓ YouTubeには、有益な動画が多くあり、タイトルだけではなく、内容も踏まえた高度な検索が可能になる出典：Gemini

6. テキスト×動画 ⚫ Video Description and Summarization 出典：Video-LLaMA: An Instruction-tuned
Audio-Visual Language Model for Video Understanding

6. テキスト×動画 ⚫ Video Description and Summarization • Google Gemini（旧Bard）を用いたYouTube動画の解説/要約作成
✓ GeminiはGoogleの提供するサービスと連携することが可能である ✓ YouTubeには、有益な動画が多くあり、解説/要約を作成することで効率的な情報収集が可能となる出典：GoogleのチャットAI「Bard」でYouTube動画の内容を要約させることが可能に、コンテンツ作成者に悪影響が及ぶ懸念も

6. テキスト×動画 ⚫ Video Classification 出典：UCF Sports Action Data Set

6. テキスト×動画 ⚫ Video Classification • 公共施設や店内の防犯システム ✓ 防犯カメラの記録を常に監視するのは、非常に労力のかかる作業である ✓
VLMsを用いて、動画分類を行い、不審な出来事や行動を検出することで、上記の作業を効率化する出典：Top 18 Applications of Computer Vision in Security and Surveillance

6. テキスト×動画 ⚫ Video Question Answering 出典：ChatVideo: A Tracklet-centric Multimodal
and Versatile Video Understanding System

7. マルチモーダル生成AIに関するリスク 41

7. マルチモーダル生成AIに関するリスク 42 ⚫ ハルシネーション（幻覚） • 大規模言語モデル（LLMs）は、真実でない内容を作り出すことがある（ハルシネーション） • Vision-Language Modelsも同じように、ハルシネーションを起こすことが知られている
出典：Evaluating Object Hallucination in Large Vision-Language Models

7. マルチモーダル生成AIに関するリスク 43 ⚫ 有害な画像の生成 • 画像生成AIには、セキュリティ機構が備わっており、有害な画像を生成しないようになっている • 例えば、生成画像が、禁止コンセプトと近いかどうか類似度を計算し、ある閾値を超えるとエラーを出力する出典：Red-Teaming
the Stable Diffusion Safety Filter

7. マルチモーダル生成AIに関するリスク 44 ⚫ 有害な画像の生成 • Prompt Dilution攻撃では、狙った画像とは無関係な単語を複数挿入して“有害画像の度合い”を希釈することで、セキュリティ機構のバイパスを試みる出典：Red-Teaming
the Stable Diffusion Safety Filter

7. マルチモーダル生成AIに関するリスク 45 ⚫ 有害な画像の生成 • Macaronic Prompting攻撃では、異なる言語の単語を創造的に組み合わせることで、人間には理解できないにもかかわらず、DALL-E 2に狙った画像を生成させることができる造語を作成する
出典： • DALL-E 2などの画像生成AIに対する敵対的攻撃 • Adversarial Attacks on Image Generation With Made-Up Words 生成させたい画像ドイツ語イタリア語フランス語スペイン語造語

7. マルチモーダル生成AIに関するリスク 46 ⚫ 画像に埋め込まれた指示による有害な回答生成 • 生成AIは、有害なコンテンツ（マルウェア、人種差別を連想させる等）は出力しないようになっている • しかし、画像のフィルタリングは非常に難しく、悪意のある指示内容を画像に埋め込まれると、その内容に沿った回答を出力することがある
出典：Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs

8. まとめ 47

8. まとめ 48 ⚫ マルチモーダル生成AIは盛んに研究されているトピックであり、今後も数多くのモデルやサービスが登場すると思われる ⚫ 代表的なモデルはGPT-4VとGeminiであり、様々なタスクにおいて、他モデルより優れた性能を叩き出している ⚫ テキストと画像/動画を扱う様々なアプリケーションが出現しており、サービスに組み込まれ始めている ⚫
マルチモーダル生成AIにも、ハルシネーション等の多くのリスクが存在しており、使用する際には念頭に置く必要がある

Appendix 49

参考文献 50 ⚫ Structure and Content-Guided Video Synthesis with Diffusion
Models ⚫ From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness, and Causality through Four Modalities ⚫ Identifying Geographical Location of the Image ⚫ 生成AI時代におけるUXデザイン | 生成AIをフル活用したUX設計手法＆生成AI時代のユーザー体験の変化について ⚫ screenshot-to-code ⚫ A Survey on Hallucination in Large Vision-Language Models ⚫ Red-Teaming the Stable Diffusion Safety Filter ⚫ DALL-E 2などの画像生成AIに対する敵対的攻撃 ⚫ Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs ⚫ Evaluating Object Hallucination in Large Vision-Language Models ⚫ MM-Vid:Advancing Video Understanding with GPT-4V(ision) ⚫ A Tour of Video Understanding Use Cases ⚫ 伊藤園、生成AIでCMモデル「お～いお茶」SNSで拡散 ⚫ Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

参考文献 51 ⚫ ChatVideo: A Tracklet-centric Multimodal and Versatile Video
Understanding System ⚫ Diagrams: Show Me ⚫ グーグルのAI「Bard」が劇的進化、YouTube動画の要約や質問が可能に ⚫ GoogleのチャットAI「Bard」でYouTube動画の内容を要約させることが可能に、コンテンツ作成者に悪影響が及ぶ懸念も ⚫ Deep Learning-Based Anomaly Detection in Video Surveillance: A Survey ⚫ Top 18 Applications of Computer Vision in Security and Surveillance ⚫ UCF Sports Action Data Set ⚫ MM-LLMs: Recent Advances in MultiModal Large Language Models ⚫ Adversarial Attacks on Image Generation With Made-Up Words ⚫ Hallucination Leaderboard

マルチモーダル生成AIの最前線～アプリケーションと考えるべきリスク～

マルチモーダル生成AIの最前線～アプリケーションと考えるべきリスク～

More Decks by YusukeJustinNakajima

Other Decks in Technology

Featured

Transcript