Embodied AIについて / About Embodied AI

© NTT Communications Corporation All Rights Reserved. メディアAI PJ 勉強会
Embodied AIについて NTTコミュニケーションズイノベーションセンターテクノロジー部門鈴ヶ嶺聡哲

© NTT Communications Corporation All Rights Reserved. 2 アジェンダ 1.
Embodied cognition 2. Embodied AI 3. 課題 4. タスク • Navigation • Rearrangement • Vision-and-Language 5. シミュレーター • Habitat • Isaac Sim • TDW • AI2-THOR • SAPIEN 6. 共通的なアプローチ • Large-Scale Training • Visual Pre-Training • End-to-end, Modular • Visual and Dynamic Augmentation 7. まとめ

© NTT Communications Corporation All Rights Reserved. 3 Embodied cognition
• Embodied cognition: 身体化された認知 • 知能は感覚システムを通じて、エージェントと環境の相互作用によって形成されるという理論 • 発達心理学が示す知的エージェントを育てる6つの教訓 1. マルチモーダルであること (Be Multimodal) • 複数の感覚システムを用いて知覚し行動すること 2. 段階的であること (Be Incremental) • 最初から賢いわけではない 3. 身体的であること (Be Physical) • 規則的な物理法則の相互作用や経験から学ぶ 4. 探索すること (Explore) • 新しい問題や解決策を発見する 5. 社会的であること (Be Social) • より成熟したエージェントから学習する 6. 言語を学ぶこと (Learn a Language) • 高度で抽象的な認識を獲得する Linda Smith and Michael Gasser. 2005. The Development of Embodied Cognition: Six Lessons from Babies. Artif. Life 11, 1–2 (January 2005), 13–30. https://doi.org/10.1162/1064546053278973 言語学習の4つのステップりんごと複数の感覚システム（視覚、聴覚）の相互作用

© NTT Communications Corporation All Rights Reserved. 4 Embodied AI
• Embodied AIはEmbodied cognition（身体化された認知の考え）を基に作成される知的エージェント • 具体的には複数の感覚システム（視覚、触覚、聴覚等）を備えた自律的に学習するロボットの作成 • コンピュータビジョンのトップカンファレンスのCVPR（Computer Vision and Pattern Recognition）のworkshopとして活動を続けている • 2023年の3つのテーマ 1. Foundation Models • 規模な事前学習済みモデルを基に新規タスクを少ない学習で対応すること 2. Generalist Agents • 単一の学習手法により1つのタスクで訓練されたモデルを新たなタスクに拡張すること 3. Sim to Real Transfer • シミュレーションで学習したモデルを実世界に展開できるようにする技術 https://embodied-ai.org

© NTT Communications Corporation All Rights Reserved. 5 課題 •
複数のシミュレーションとタスクから課題が構成される • タスクは主に3つに分類される • Navigation • Rearrangement • Vision-and-Language 課題一覧 https://embodied-ai.org/#challenges

© NTT Communications Corporation All Rights Reserved. 6 Navigation 1/3
• PointNav • 事前に環境マップを与えられない条件で、ゴールを開始位置からの相対位置で指定して探索する • 基本のエージェントの入力はRGB、深度画像や自己運動センサー • Interactive and Social PointNav • 家具などの動的なオブジェクト、歩行者などの動的なエージェントを含む環境においてゴールに到達するタスク • 倉庫のような静的な環境では素晴らしい結果を出している • 一方で家庭やオフィスのような動的な環境は難しい PointNavは事前に環境マップは与えられない Deitke, Matt, et al. "Retrospectives on the embodied ai workshop." arXiv preprint arXiv:2210.06849 (2022). 動的な環境 (a)小さな障害物を退ける必要がある (b)歩行者のパーソナルスペースを確保する必要がある

• ObjectNav • 例えばベッドなどの指定したオブジェクトにナビゲートするタスク • エージェントのカメラのフレーム内にオブジェクトが見えた状態で Doneアクションを発行した場合に成功とする • Multi-ObjectNav • 順序づけられた形で複数オブジェクトを探索するタスクベッドまでナビゲートする例順序づけて複数オブジェクトを探索する例 Deitke, Matt, et al. "Retrospectives on the embodied ai workshop." arXiv preprint arXiv:2210.06849 (2022).

• Navigating to Identify All Objects in a Scene • Robotic Vision Scene Understanding(RVSU)のSLAMタスクでは環境を探索して全てのオブジェクトを意味づけをする • どのオブジェクトがどこにあるかを問われる • 一般的にはセマンティックSLAMとして捉えられる • Audio-Visual Navigation • 従来の画像入力に音を加えたマルチモーダルなナビゲーションタスク • 音を介して環境内の位置を推定する必要がある • 具体的には環境内のランダムな位置に電話などの音源が置かれ、音情報をもとに探索する RVSUのSLAMタスク Deitke, Matt, et al. "Retrospectives on the embodied ai workshop." arXiv preprint arXiv:2210.06849 (2022). 画像・音情報でオブジェクトを探索する

© NTT Communications Corporation All Rights Reserved. 9 Rearrangement •
Scene Change Detection • 2枚の同じシーンから変化（追加、削除）したオブジェクトを識別するタスク • Interactive Rearrangement • AI2-THOR Visual Room Rearrangementでは、まずウォークスルーフェーズで環境を探索し、次にアンシャッフルフェーズで異なる場所に移動されたオブジェクトを初期状態に戻す作業をする Deitke, Matt, et al. "Retrospectives on the embodied ai workshop." arXiv preprint arXiv:2210.06849 (2022). シーン変化検出の例 AI2-THOR Visual Room Rearrangement オブジェクトを初期状態に戻す例

© NTT Communications Corporation All Rights Reserved. 10 Vision-and-Language •
Navigation Instruction Following • 自然言語の長い行動シーケンスによって誘導されるナビゲーションタスク • 右上図では寝室から廊下を通りアイランドキッチンと椅子の間に誘導される • Interactive Instruction Following • インタラクティブな環境において、自然言語から行動に結びつけるタスク • 右下図では専門家による火事実演に対応する言語指示から行動する Deitke, Matt, et al. "Retrospectives on the embodied ai workshop." arXiv preprint arXiv:2210.06849 (2022). 自然言語で記述された経路を探索する例オブジェクトのインタラクション、状態変化の追跡、過去の参照などを含む

© NTT Communications Corporation All Rights Reserved. 11 課題 •
複数のシミュレーターとタスクから課題が構成される • タスクは主に3つに分類される • Navigation • Rearrangement • Vision-and-Language • シミュレーター • Habitat • Isaac Sim • TDW • AI2-THOR • SAPIEN 課題一覧

© NTT Communications Corporation All Rights Reserved. 12 Habitat •
Habitat 3.0 • Meta AIが主に開発したシミュレーション • インタラクティブな環境でロボット、人型アバターをサポート • 人型アバターの特徴 • 複数の性別や体型、外見を変更可能 • Humans-In-The-Loop インターフェイス（VRヘッドセット、キーボード、マウス）による操作が可能 • システム内に人間が参加する • 利用方法 • Habitat-Lab https://github.com/facebookresearch/habitat-lab Navigationタスク Rearrangementタスク VRヘッドセットによるHumans-In-The-Loop

© NTT Communications Corporation All Rights Reserved. 13 Isaac Sim
• Isaac Sim • NVIDIAが開発するロボティクスシミュレーション • デジタルツイン構築を支援するプラットフォームOmniverseが基盤 • ロボット開発に用いられるROSをサポート • Omnigraph（ビジュアルプログラミング）をサポート • 利用方法 • NVIDIA Omniverse https://www.nvidia.com/en-us/omniverse/ Isaac Simの全体像人物シミュレーション Omnigraphによるビジュアルプログラミング

© NTT Communications Corporation All Rights Reserved. 14 TDW •
ThreeDWorld (TDW) • MIT Brain and Cognitive Sciences(BCS)が開発 • 画像レンダリング、音声合成、物理シミュレーションは Unity3Dエンジン上で動作し、外部インターフェイスは Pythonでコントロールする • 利用方法 • ThreeDWorld (TDW) https://github.com/threedworld-mit/tdw TDWによるシミュレーション例他シミュレーターとの比較

© NTT Communications Corporation All Rights Reserved. 15 AI2-THOR •
iTHOR • 120の部屋（キッチン、ベッドルーム、バスルーム等）をサポート • 2000以上のユニークなオブジェクト • Unity3Dが物理シミュレーションのベース • ヒューマノイドやドローンなど複数エージェントをサポート • RoboTHOR • 89のアパート、600以上のオブジェクトをサポート • シミュレーションと実際の環境を利用可能 • ManipulaTHOR • ロボットアームを利用して物体を操作するシミュレーター • 利用方法 • ai2thor https://github.com/allenai/ai2thor 他フレームワークとの比較 RoboTHOR iTHOR

© NTT Communications Corporation All Rights Reserved. 16 SAPIEN •
A SimulAted Part-based Interactive Environment (SAPIEN) • UCSD、Stanford、SFUの共同研究 • 多関節オブジェクトの大規模セットをサポート • Pure Pythonインターフェイスで操作可能 • 深度マップ、法線マップ、オプティカルフロー、アクティブライト、レイトレーシングなどの複数のレンダリングをサポート • PartNet-Mobility Dataset • 2000個の多関節オブジェクトのデータセット • 利用方法 • https://sapien.ucsd.edu/downloads シミュレーション例 PartNet-Mobility Dataset

© NTT Communications Corporation All Rights Reserved. 17 共通的なアプローチ •
Large-Scale Training • コンピュータビジョンや自然言語と同様に大規模なデータセットによる学習が高性能な結果を残している • ProcTHORで生成した10,000個の家屋で事前学習したモデルでHabitat 2022 ObjectNav Challenge、RoboTHOR ObjectNav challenge、I2- THOR Rearrangement ChallengeでSOTAな結果を達成 • Visual Pre-Training • 視覚バックボーンを事前学習されたCLIPベースのResNET-50に置き換えることで大幅に改善することが示されている • 1-Phase Rearrangement, RoboTHOR ObjectNav leaderboards, and Habitat ObjectNav leaderboardの上位モデルは事前学習された EmbCLIPバックボーンの改善モデルを使用 • End-to-end, Modular • End-to-end: 入力から直接行動を予測することを学習する • Modular: 複数のモジュールを使って行動を予測する、各モジュールは個々に学習する • 次の表で最良のそれぞれの手法の性能を示している • ObjectNavやAudio-Visual Navigationのようなタスクは両手法同等の性能を示しているが、一方でInteractive NavigationやRearrangementなどの複雑なタスクはModular手法が大きく性能を発揮する • 強化学習のEnd-to-end学習では探索の複雑さが指数関数的に増加することが要因と考えられる • Visual and Dynamic Augmentation • 実世界のデータセットを拡張することはロボットを未知の環境や現実への移行に重要 • 例えば、LiDARベースで学習したモデルを実世界に反映させる例で有用性が示されている Deitke, Matt, et al. "Retrospectives on the embodied ai workshop." arXiv preprint arXiv:2210.06849 (2022).

© NTT Communications Corporation All Rights Reserved. 18 まとめ •
Embodied AIはEmbodied cognition（身体化された認知の考え）の考えを基に視覚、触覚、聴覚等の複数センサーを備えた自律的に学習するロボットの作成が目的 • 3つの主なタスクを紹介 • Navigation • Rearrangement • Vision-and-Language • 多種多様なシミュレータが存在 • Habitat • Isaac Sim • TDW • AI2-THOR • SAPIEN • 共通的なアプローチを紹介 • Large-Scale Training • Visual Pre-Training • End-to-end, Modular • Visual and Dynamic Augmentation

© NTT Communications Corporation All Rights Reserved. 19 参考文献 •
Linda Smith and Michael Gasser. "The Development of Embodied Cognition: Six Lessons from Babies. Artif." https://cogdev.sitehost.iu.edu/labwork/6_lessons.pdf • Embodied AI Workshop CVPR 2023 https://embodied-ai.org/ • Deitke, Matt, et al. "Retrospectives on the embodied ai workshop.” https://arxiv.org/abs/2210.06849 • The Robotic Vision Challenges(RVSU) https://nikosuenderhauf.github.io/roboticvisionchallenges/scene- understanding.html • AI2-THOR https://ai2thor.allenai.org/ • Habitat 3.0 https://aihabitat.org/habitat3/ • Isaac Sim https://developer.nvidia.com/ja-jp/isaac-sim • ThreeDWorld (TDW) https://www.threedworld.org/ • A SimulAted Part-based Interactive Environment (SAPIEN) https://sapien.ucsd.edu/ • Deitke, Matt, et al. "ProcTHOR: Large-Scale Embodied AI Using Procedural Generation." https://arxiv.org/abs/2206.06994 • Khandelwal, Apoorv, et al. "Simple but Effective: CLIP Embeddings for Embodied AI." https://arxiv.org/abs/2111.09888 • Faust, Aleksandra, et al. "PRM-RL: Long-range Robotic Navigation Tasks by Combining Reinforcement Learning and Sampling-based Planning." https://arxiv.org/abs/1710.03937

Embodied AIについて / About Embodied AI

Embodied AIについて / About Embodied AI

NTT docomo Business

More Decks by NTT docomo Business

Other Decks in Research

Featured

Transcript

© NTT Communications Corporation All Rights Reserved. メディアAI PJ 勉強会

© NTT Communications Corporation All Rights Reserved. 2 アジェンダ 1.

© NTT Communications Corporation All Rights Reserved. 3 Embodied cognition

© NTT Communications Corporation All Rights Reserved. 4 Embodied AI

© NTT Communications Corporation All Rights Reserved. 5 課題 •

© NTT Communications Corporation All Rights Reserved. 6 Navigation 1/3

© NTT Communications Corporation All Rights Reserved. 7 Navigation 2/3

© NTT Communications Corporation All Rights Reserved. 8 Navigation 3/3

© NTT Communications Corporation All Rights Reserved. 9 Rearrangement •

© NTT Communications Corporation All Rights Reserved. 10 Vision-and-Language •

© NTT Communications Corporation All Rights Reserved. 11 課題 •

© NTT Communications Corporation All Rights Reserved. 12 Habitat •

© NTT Communications Corporation All Rights Reserved. 13 Isaac Sim

© NTT Communications Corporation All Rights Reserved. 14 TDW •

© NTT Communications Corporation All Rights Reserved. 15 AI2-THOR •

© NTT Communications Corporation All Rights Reserved. 16 SAPIEN •

© NTT Communications Corporation All Rights Reserved. 17 共通的なアプローチ •

© NTT Communications Corporation All Rights Reserved. 18 まとめ •

© NTT Communications Corporation All Rights Reserved. 19 参考文献 •