Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Trends in Multimodal Models and Autonomous Driving

Yu Yamaguchi
January 14, 2025
330

Trends in Multimodal Models and Autonomous Driving

2025/1/15の講義用資料です。

大規模言語モデル(LLM)や拡散モデルを中心とした生成AIの発展は、マルチモーダル技術への関心を急速に高めています。特に、視覚、言語、動作といった複数のモーダルを統合的に処理するマルチモーダル基盤モデルは、ロボティクスや自動運転の分野で革新を引き起こしています。本講演では、マルチモーダル基盤モデルをとりまく技術的なトレンドを解説するとともに、自動運転を中心に最新の応用例を紹介します。

Yu Yamaguchi

January 14, 2025
Tweet

Transcript

  1. About Me Yu Yamaguchi CTO / Director of AI, Turing

    Inc. • Former researcher at AIST and NIST, developing AI for Go and Shogi. • Joined Turing Inc. in 2022 as a founding member after serving as an executive officer at a public company. • Leads AI research for autonomous driving. 3
  2. Turing Inc. Total Funding: $50MM Employees: 50+ Overview Business Development

    of Fully Autonomous Vehicles Aiming to achieve it through Generative AI. Founded: August 2021 CEO: Issei Yamamoto 4
  3. Contents • Multimodal Models ◦ Trends in recent large-scale models

    • Autonomous Driving Technology ◦ The DARPA Challenge and its legacy • Multimodal × Autonomous Driving ◦ Applications of multimodal AI centered around LLMs 5
  4. Contents • Multimodal Models ◦ Trends in recent large-scale models

    • Autonomous Driving Technology ◦ The DARPA Challenge and its legacy • Multimodal × Autonomous Driving ◦ Applications of multimodal AI centered around LLMs 6
  5. Multimodal Models • LLMs as the Core of Cognition ◦

    Since CLIP [Radford et al., 2021], significant advancement in technologies that connect specific modalities with language models. ◦ Using LLMs can greatly reduce training costs. Representative Multimodal Models [Zhang et al., 2024] 7
  6. Mechanism of Multimodal Models text Image Video Audio NFNet-F6 ViT

    CLIP ViT Eva-CLIP ViT ︙ C-Former HuBERT Encoder Inputs BEATs ︙ Audio Linear Projctor MLP Cross- attention Q-Former P-Former MQ-Former Input Projector (Adapter) ︙ Multimodal Understanding Image / Video Flan-T5 UL2 Qwen OPT LLM Backbone ︙ LLaMA LLaMA-2 Vicuna Language text Language Tiny Transformer MLP Output Projector ︙ Stable Diffusion Zeroscope Generator AudioLDM ︙ Image Video Audio Outputs Multimodal Generation Recreate from [Zhang+ 2024] Fig.2 9
  7. Vision-Language Models (VLMs) The mainstream approach connects a pretrained LLM

    with a vision encoder using an adapter. Wang, Jianfeng, et al. "Git: A generative image-to-text transformer for vision and language." arXiv preprint arXiv:2205.14100 (2022). 10
  8. How to Tokenize Image Features Feature vectors Projector (MLP) Image

    Encoder Image tokens Transformer Language tokens Image Encoder Adapter Language tokens Transformer Special tokens Using projector GIT [Wang+], LLaVA [Liu+]... Using cross attention BLIP2 [Li+], Flamingo [Alayrac+] Alayrac, Jean-Baptiste, et al. "Flamingo: a visual language model for few-shot learning." NeurIPS 2022. Li, Junnan, et al. "Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models." ICML 2023. Wang, Jianfeng, et al. "Git: A generative image-to-text transformer for vision and language." arXiv preprint arXiv:2205.14100 (2022). Liu, Haotian, et al. "Visual instruction tuning." NeurIPS 2024. 11
  9. Flamingo [Alayac+ 2022] Models capable of processing images, videos, and

    text simultaneously enable few-shot learning. • Image encoder + LLM ◦ Pretrained CLIP and Chinchilla [Hoffmann+ 2022] ◦ Add and train Gated Cross-Attention as a projector. ◦ Efficiently convert images and videos into fixed-length tokens using Perceiver [Jaegle+ 2021]. 12
  10. LLaVA [Liu+ 2023] Achieve high performance with high-quality instruction-tuning data

    for image-language tasks. • Instruction-Tuning Data ◦ Generate a large amount of data for the COCO dataset using GPT-4. 13
  11. Heron [Tanahashi+ 2023] • Add visual modules to pretrained LLMs

    with any combination. • Train a 73B parameter vision-language model. 14
  12. * Image used in the demo from the GPT-4 technical

    report. 画像では、⻩⾊いタクシーが路上に停 まっており、その上に⻩⾊いシャツを 着た男が座っている。タクシーの荷台 には、アイロンがけや洗濯物をたたむ など、さまざまな作業をしている。こ のシーンの⾯⽩い点は、タクシーの上 に置かれたアイロンの存在である。 Heron [Tanahashi+ 2023] 15
  13. How to Create “Image Tokens” VQ-VAE TiTok Van Den Oord,

    Aaron, and Oriol Vinyals. "Neural discrete representation learning." Advances in neural information processing systems 30 (2017). Yu, Qihang, et al. “An Image is Worth 32 Tokens for Reconstruction and Generation.” The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024, https://openreview.net/forum?id=tOXoQPRzPL. 17
  14. Interleaved Text-and-Image Generation Consistently understand and generate data where text

    and images are interleaved. Chameleon [C Team+ 2024] Team, Chameleon. "Chameleon: Mixed-modal early-fusion foundation models." arXiv preprint arXiv:2405.09818 (2024). 18
  15. Contents • Multimodal Models ◦ Trends in recent large-scale models

    • Autonomous Driving Technology ◦ The DARPA Challenge and its legacy • Multimodal × Autonomous Driving ◦ Applications of multimodal AI centered around LLMs 19
  16. Levels of Autonomous Driving Level 0 Level 1 Level 2

    Level 3 Level 4 Level 5 No Autonomous Driving Accelerator/Brake or Steering Wheel Accelerator/Brake and Steering Wheel System drives in specific conditions (driver required). System drives in specific conditions. Fully Autonomous Driving Equipped in many commercial vehicles (e.g., cruise control) Some commercial services available. Humanity has yet to achieve 20
  17. History of Autonomous Driving (2004~2024) 2004 The DARPA Grand Challenge

    2007 CMU won the DARPA Urban Challenge. 2009 Google X self-driving car project 2010 Nebraska authorized self-driving cars on public roads 2014 Tesla began developing Autopilot. 2015 SAE defined autonomous driving levels 2018 Waymo launched commercial self-driving taxi 2020 Honda launched a Level 3 autonomous vehicle. 2024 2021 Tesla released FSD 12 with an end-to-end system Waymo began Level 4 services. 21
  18. DARPA Grand Challenge (2004-2007) A competition for autonomous vehicles organized

    by the U.S. DARPA. • 2004 Grand Challenge ◦ 240 km course in the Mojave Desert ◦ No teams finishing the race (only 12km) • 2005 Grand Challenge ◦ 212 km off-road course. ◦ 5 teams completed • 2007 Urban Challenge ◦ 96 km course designed to simulate urban environments → Waymo, Zoox, Argo, Nuro, Aurora, etc... The vehicle from CMU that won the 2007 DARPA Urban Challenge. [robot.watch.impress.co.jp/cda/news/2007/11/08/733.html] 22
  19. LiDAR + HD mapping technology (2010~) Utilized for advanced autonomous

    driving at Level 3 or 4 Combining LiDAR sensors with high-precision 3D maps → High cost of map creation and sensors. High-precision 3D maps Point cloud data captured by LiDAR sensors. 23
  20. LiDAR-based autonomous driving Image Point Cloud HD maps Perception •

    物体認識 • 標識認識 • レーン認識 Prediction • 移動予測 • 将来マップ予測 • 交通エージェント Planning • 探索問題 • 経路計画 Control • 制御アルゴリズ ム https://paperswithcode.com/dataset/nuscenes Prediction Planning Perception Modules operate independently by function → Difficult to achieve overall optimization. 24
  21. The Rise of Deep Learning (2012~) Starting with image recognition,

    DNN became the mainstream. • Image recognition (2012) ◦ AlexNet dominated the image recognition. ◦ The foundation of modern convolutional neural networks. • Defeated the world champion in Go (2016) ◦ DeepMind's AlphaGo surpassed human performance. ◦ Effective in intelligent tasks. In 2017, Ke Jie played against AlphaGo. [www.youtube.com/watch?v=1U1p4Mwis60] The roots of CNNs: AlexNet's architecture. [Krizhevsky+ 2017] 25
  22. DAVE-2 [Bojarski+ 2016] • NVIDIA developed an automotive SoC capable

    of running CNNs at 30fps, enabling autonomous driving. • Collected 72 hours of data and successfully drove 10 miles hands-free. Overview of the data collection system. (NVIDIA DrivePX, 24TOPS) www.youtube.com/watch?v=NJU9ULQUwng 26
  23. End-to-end Autonomous Driving マルチカメラ画像 Neural Network 車の経路 An end-to-end model

    to output driving paths directly from images. Images Point Cloud HD Maps Perception • 物体認識 • 標識認識 • レーン認識 Prediction • 移動予測 • 将来マップ予測 • 交通エージェント Planning • 探索問題 • 経路計画 Control • 制御アルゴリ ズム Processes inputs like sensors and high-precision maps in separate modules. 27
  24. UniAD [Hu+ 2023] An end-to-end framework learning vehicle control using

    only cameras. Optimizes all modules simultaneously. Selected as CVPR 2023 Best Paper. 28
  25. Tesla FSD v12~ The car naturally avoids puddles, even without

    directly learning. [x.com/AIDRIVR/status/1760841783708418094?s=20] Tesla's latest autonomous driving system deployed in US. Transitioned to end-to-end in v12, reducing 300,000 lines of code 29
  26. Gen-3 Autonomous Driving Tasks (2023~) Autonomous driving research is shifting

    to natural language situational understanding with generative AI. [Li+ 2024] Gen 1 (CNN, 2012~) Gen 2 (Transformer, 2019~) Gen 3 (LLM, 2023~) • Front cam • LiDAR • Multi cam • LiDAR • Radar • HD maps • Multi-cam • Language DriveLM [Sima+ 2023] nuScenes [Caesar+ 2019] KITTI [Geiger+ 2012] 30
  27. Contents • Multimodal Models ◦ Trends in recent large-scale models

    • Autonomous Driving Technology ◦ The DARPA Challenge and its legacy • Multimodal × Autonomous Driving ◦ Applications of multimodal AI centered around LLMs 31
  28. Complex Traffic Scene Understanding 33 Understanding text-based signs Pedestrian OR

    traffic controller Traffic controllers and traffic signals Traffic area restrictions Humans can instantly understand the "context"
  29. LLM in Vehicle [Tanahashi+ 2023] Pioneered LLM in Vehicle, using

    LLMs to directly control cars.(Jun. 2023) • Object detection + GPT-4 + control. • Handles complex instructions and decisions ◦ “Go to the cone that is the same color as a banana.” ◦ “Turning right causes an accident involving one person, while turning left involves five.” LLM in Vehicleのデモ⾞両 35
  30. LingoQA [Marcu+ 2023] Use VLM to enable situational understanding and

    driving decision-making within a question-and-answer framework. Marcu, Ana-Maria, et al. "Lingoqa: Video question answering for autonomous driving." arXiv preprint arXiv:2312.14115 (2023) 36
  31. LMDrive [Shao+ 2023] 37 Achieved end-to-end driving control using only

    a language model, enabling driving in a simulator environment.
  32. DriveVLM [Tian+ 2024] 38 Performs scene understanding and planning within

    the language model, similar to CoT, while integrating with existing autonomous driving systems.
  33. RT-2 [Brohan+ 2023] Fine-tune a pre-trained VLM with action data

    from a robot arm. Zitkovich, Brianna, et al. "Rt-2: Vision-language-action models transfer web knowledge to robotic control." CoRL 2023. Proposed a new paradigm called the Vision-Language-Action (VLA) model. 39
  34. CoVLA [Arai+ 2024] Arai, Hidehisa, et al. "CoVLA: Comprehensive Vision-Language-Action

    Dataset for Autonomous Driving." arXiv preprint arXiv:2408.10845 (2024) Comprehensive dataset integrating Vision, Language, and Actions. Language Action “ The ego vehicle is moving slowly and turning right. There is a traffic light displaying a green signal … “ Frame-level captions Future trajectories Object of concern Scene recognition Reasoning captions Rule-based algorithm Behavior captions Sensor fusion Reconstructed trajectory Sensor signals Control information Throttle/brake position Steering angle Turn signal Vision 30s x 10,000 videos Radar Leading vehicle Position Speed Position Signal Object detection model Traffic light VLM 40
  35. Taking situational understanding with VLMs a step further by enabling

    the model to directly output driving actions. Arai, Hidehisa, et al. "CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving." arXiv preprint arXiv:2408.10845 (2024) CoVLA [Arai+ 2024] Ground truth caption: The ego vehicle is moving straight at a moderate speed following leading car with acceleration. There is a traffic light near the ego vehicle displaying a green signal. … Predicted caption: The ego vehicle is moving at a moderate speed and turning right. There is a traffic light near the ego vehicle displaying a green signal. … VLAMが予測し た軌跡 実際の 軌跡 41
  36. Roadmap for the VLA model. 1. VLM 2. Driving data

    3. VLA models Build a large-scale dataset and train a state-of-the-art open model. Collect and curate 3,000 hours of 3D data. Spatial awareness and understanding the physical world = Embodied AI JAVLA-Dataset Heron-VILA-13B 42
  37. GameNGen [Valevski+ 2024] Build a real-time world model using diffusion

    models. Valevski, Dani, et al. "Diffusion Models Are Real-Time Game Engines." arXiv preprint arXiv:2408.14837 (2024). https://www.youtube.com/watch?v=O3616ZFGpqw 43
  38. World Model [Ha+ 2018] A model that constructs internal representations

    to understand, predict, and learn from the surrounding environment. = Inner modal VMC Model [D. Ha+ 2018] Abstracting oneself riding a bicycle. 44
  39. GAIA-1 [Hu+ 2023] A world model for autonomous driving that

    predicts driving states and generates future visuals. • Extended to multimodal capabilities, including language and video. ◦ Convert videos into discrete tokens to be processed by Transformers like language tokens. ( GAIA-1 Action conditioning ) 45
  40. Terra [Arai+ 2024] • Can generate outputs based on any

    specified driving route. • Exhibits very high instruction-following capability. Current scene Driving route + 46
  41. Contents • Multimodal Models ◦ Trends in recent large-scale models

    • Autonomous Driving Technology ◦ The DARPA Challenge and its legacy • Multimodal × Autonomous Driving ◦ Applications of multimodal AI centered around LLMs 47
  42. “The fundamental approach surpasses traditional systems.”
 
 
 
 48

    The Shogi AI “Ponanza,” developed by CEO Yamamoto, improved at a pace surpassing rule-based system through machine learning. In 2017, it became the first in Japan to defeat a reigning Shogi Grandmaster.
 CEO Yamamoto and the Shogi AI “Ponanza” Performance Technological Progress Today Exponential growth AI Model Linear growth Rule-based Model