Trends in Multimodal Models and Autonomous Driving

Trends in Multimodal Models and Autonomous Driving Turing Inc. CTO
Yu Yamaguchi Jan 15th, 2025

Information 2 speakerdeck.com OR x.com/ymg_aq

About Me Yu Yamaguchi CTO / Director of AI, Turing
Inc. • Former researcher at AIST and NIST, developing AI for Go and Shogi. • Joined Turing Inc. in 2022 as a founding member after serving as an executive oﬃcer at a public company. • Leads AI research for autonomous driving. 3

Turing Inc. Total Funding: $50MM Employees: 50+ Overview Business Development
of Fully Autonomous Vehicles Aiming to achieve it through Generative AI. Founded: August 2021 CEO: Issei Yamamoto 4

Contents • Multimodal Models ◦ Trends in recent large-scale models
• Autonomous Driving Technology ◦ The DARPA Challenge and its legacy • Multimodal × Autonomous Driving ◦ Applications of multimodal AI centered around LLMs 5

Multimodal Models • LLMs as the Core of Cognition ◦
Since CLIP [Radford et al., 2021], signiﬁcant advancement in technologies that connect speciﬁc modalities with language models. ◦ Using LLMs can greatly reduce training costs. Representative Multimodal Models [Zhang et al., 2024] 7

Training from Vision and Language 画像は、積雪した市街地の道路です。遠くには雪⼭が⾒え、看板から北海道の⽺蹄⼭であると思われます。制限時速は 40kmですが、積雪のため速度を落として運転するべきです。
+ Image Language Traing Multimodal model 8

Mechanism of Multimodal Models text Image Video Audio NFNet-F6 ViT
CLIP ViT Eva-CLIP ViT ︙ C-Former HuBERT Encoder Inputs BEATs ︙ Audio Linear Projctor MLP Cross- attention Q-Former P-Former MQ-Former Input Projector (Adapter) ︙ Multimodal Understanding Image / Video Flan-T5 UL2 Qwen OPT LLM Backbone ︙ LLaMA LLaMA-2 Vicuna Language text Language Tiny Transformer MLP Output Projector ︙ Stable Diffusion Zeroscope Generator AudioLDM ︙ Image Video Audio Outputs Multimodal Generation Recreate from [Zhang+ 2024] Fig.2 9

Vision-Language Models (VLMs) The mainstream approach connects a pretrained LLM
with a vision encoder using an adapter. Wang, Jianfeng, et al. "Git: A generative image-to-text transformer for vision and language." arXiv preprint arXiv:2205.14100 (2022). 10

How to Tokenize Image Features Feature vectors Projector (MLP) Image
Encoder Image tokens Transformer Language tokens Image Encoder Adapter Language tokens Transformer Special tokens Using projector GIT [Wang+], LLaVA [Liu+]... Using cross attention BLIP2 [Li+], Flamingo [Alayrac+] Alayrac, Jean-Baptiste, et al. "Flamingo: a visual language model for few-shot learning." NeurIPS 2022. Li, Junnan, et al. "Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models." ICML 2023. Wang, Jianfeng, et al. "Git: A generative image-to-text transformer for vision and language." arXiv preprint arXiv:2205.14100 (2022). Liu, Haotian, et al. "Visual instruction tuning." NeurIPS 2024. 11

Flamingo [Alayac+ 2022] Models capable of processing images, videos, and
text simultaneously enable few-shot learning. • Image encoder + LLM ◦ Pretrained CLIP and Chinchilla [Hoffmann+ 2022] ◦ Add and train Gated Cross-Attention as a projector. ◦ Efficiently convert images and videos into fixed-length tokens using Perceiver [Jaegle+ 2021]. 12

LLaVA [Liu+ 2023] Achieve high performance with high-quality instruction-tuning data
for image-language tasks. • Instruction-Tuning Data ◦ Generate a large amount of data for the COCO dataset using GPT-4. 13

Heron [Tanahashi+ 2023] • Add visual modules to pretrained LLMs
with any combination. • Train a 73B parameter vision-language model. 14

* Image used in the demo from the GPT-4 technical
report. 画像では、⻩⾊いタクシーが路上に停まっており、その上に⻩⾊いシャツを着た男が座っている。タクシーの荷台には、アイロンがけや洗濯物をたたむなど、さまざまな作業をしている。このシーンの⾯⽩い点は、タクシーの上に置かれたアイロンの存在である。 Heron [Tanahashi+ 2023] 15

Libraries for multimodal learning https://github.com/turingmotors/vlm-recipes https://github.com/turingmotors/heron “Heron” “vlm-recipes” 16

How to Create “Image Tokens” VQ-VAE TiTok Van Den Oord,
Aaron, and Oriol Vinyals. "Neural discrete representation learning." Advances in neural information processing systems 30 (2017). Yu, Qihang, et al. “An Image is Worth 32 Tokens for Reconstruction and Generation.” The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024, https://openreview.net/forum?id=tOXoQPRzPL. 17

Interleaved Text-and-Image Generation Consistently understand and generate data where text
and images are interleaved. Chameleon [C Team+ 2024] Team, Chameleon. "Chameleon: Mixed-modal early-fusion foundation models." arXiv preprint arXiv:2405.09818 (2024). 18

Levels of Autonomous Driving Level 0 Level 1 Level 2
Level 3 Level 4 Level 5 No Autonomous Driving Accelerator/Brake or Steering Wheel Accelerator/Brake and Steering Wheel System drives in speciﬁc conditions (driver required). System drives in speciﬁc conditions. Fully Autonomous Driving Equipped in many commercial vehicles (e.g., cruise control) Some commercial services available. Humanity has yet to achieve 20

History of Autonomous Driving （2004~2024） 2004 The DARPA Grand Challenge
2007 CMU won the DARPA Urban Challenge. 2009 Google X self-driving car project 2010 Nebraska authorized self-driving cars on public roads 2014 Tesla began developing Autopilot. 2015 SAE deﬁned autonomous driving levels 2018 Waymo launched commercial self-driving taxi 2020 Honda launched a Level 3 autonomous vehicle. 2024 2021 Tesla released FSD 12 with an end-to-end system Waymo began Level 4 services. 21

DARPA Grand Challenge (2004-2007) A competition for autonomous vehicles organized
by the U.S. DARPA. • 2004 Grand Challenge ◦ 240 km course in the Mojave Desert ◦ No teams ﬁnishing the race (only 12km) • 2005 Grand Challenge ◦ 212 km oﬀ-road course. ◦ 5 teams completed • 2007 Urban Challenge ◦ 96 km course designed to simulate urban environments → Waymo, Zoox, Argo, Nuro, Aurora, etc... The vehicle from CMU that won the 2007 DARPA Urban Challenge. [robot.watch.impress.co.jp/cda/news/2007/11/08/733.html] 22

LiDAR + HD mapping technology (2010~) Utilized for advanced autonomous
driving at Level 3 or 4 Combining LiDAR sensors with high-precision 3D maps → High cost of map creation and sensors. High-precision 3D maps Point cloud data captured by LiDAR sensors. 23

LiDAR-based autonomous driving Image Point Cloud HD maps Perception •
物体認識 • 標識認識 • レーン認識 Prediction • 移動予測 • 将来マップ予測 • 交通エージェント Planning • 探索問題 • 経路計画 Control • 制御アルゴリズム https://paperswithcode.com/dataset/nuscenes Prediction Planning Perception Modules operate independently by function → Diﬃcult to achieve overall optimization. 24

The Rise of Deep Learning (2012~) Starting with image recognition,
DNN became the mainstream. • Image recognition (2012) ◦ AlexNet dominated the image recognition. ◦ The foundation of modern convolutional neural networks. • Defeated the world champion in Go (2016) ◦ DeepMind's AlphaGo surpassed human performance. ◦ Eﬀective in intelligent tasks. In 2017, Ke Jie played against AlphaGo. [www.youtube.com/watch?v=1U1p4Mwis60] The roots of CNNs: AlexNet's architecture. [Krizhevsky+ 2017] 25

DAVE-2 [Bojarski+ 2016] • NVIDIA developed an automotive SoC capable
of running CNNs at 30fps, enabling autonomous driving. • Collected 72 hours of data and successfully drove 10 miles hands-free. Overview of the data collection system. (NVIDIA DrivePX, 24TOPS) www.youtube.com/watch?v=NJU9ULQUwng 26

End-to-end Autonomous Driving マルチカメラ画像 Neural Network 車の経路 An end-to-end model
to output driving paths directly from images. Images Point Cloud HD Maps Perception • 物体認識 • 標識認識 • レーン認識 Prediction • 移動予測 • 将来マップ予測 • 交通エージェント Planning • 探索問題 • 経路計画 Control • 制御アルゴリズム Processes inputs like sensors and high-precision maps in separate modules. 27

UniAD [Hu+ 2023] An end-to-end framework learning vehicle control using
only cameras. Optimizes all modules simultaneously. Selected as CVPR 2023 Best Paper. 28

Tesla FSD v12~ The car naturally avoids puddles, even without
directly learning. [x.com/AIDRIVR/status/1760841783708418094?s=20] Tesla's latest autonomous driving system deployed in US. Transitioned to end-to-end in v12, reducing 300,000 lines of code 29

Gen-3 Autonomous Driving Tasks (2023~) Autonomous driving research is shifting
to natural language situational understanding with generative AI. [Li+ 2024] Gen 1 (CNN, 2012~) Gen 2 (Transformer, 2019~) Gen 3 (LLM, 2023~) • Front cam • LiDAR • Multi cam • LiDAR • Radar • HD maps • Multi-cam • Language DriveLM [Sima+ 2023] nuScenes [Caesar+ 2019] KITTI [Geiger+ 2012] 30

Complex Traﬃc Scene Understanding 32

Complex Traffic Scene Understanding 33 Understanding text-based signs Pedestrian OR
traffic controller Traffic controllers and traffic signals Traffic area restrictions Humans can instantly understand the "context"

Handling Edge Cases is Essential Diﬃculty Frequency ADAS End-to-end model
Multimodal model 34

LLM in Vehicle [Tanahashi+ 2023] Pioneered LLM in Vehicle, using
LLMs to directly control cars.（Jun. 2023） • Object detection + GPT-4 + control. • Handles complex instructions and decisions ◦ “Go to the cone that is the same color as a banana.” ◦ “Turning right causes an accident involving one person, while turning left involves ﬁve.” LLM in Vehicleのデモ⾞両 35

LingoQA [Marcu+ 2023] Use VLM to enable situational understanding and
driving decision-making within a question-and-answer framework. Marcu, Ana-Maria, et al. "Lingoqa: Video question answering for autonomous driving." arXiv preprint arXiv:2312.14115 (2023) 36

LMDrive [Shao+ 2023] 37 Achieved end-to-end driving control using only
a language model, enabling driving in a simulator environment.

DriveVLM [Tian+ 2024] 38 Performs scene understanding and planning within
the language model, similar to CoT, while integrating with existing autonomous driving systems.

RT-2 [Brohan+ 2023] Fine-tune a pre-trained VLM with action data
from a robot arm. Zitkovich, Brianna, et al. "Rt-2: Vision-language-action models transfer web knowledge to robotic control." CoRL 2023. Proposed a new paradigm called the Vision-Language-Action (VLA) model. 39

CoVLA [Arai+ 2024] Arai, Hidehisa, et al. "CoVLA: Comprehensive Vision-Language-Action
Dataset for Autonomous Driving." arXiv preprint arXiv:2408.10845 (2024) Comprehensive dataset integrating Vision, Language, and Actions. Language Action “ The ego vehicle is moving slowly and turning right. There is a traﬃc light displaying a green signal … “ Frame-level captions Future trajectories Object of concern Scene recognition Reasoning captions Rule-based algorithm Behavior captions Sensor fusion Reconstructed trajectory Sensor signals Control information Throttle/brake position Steering angle Turn signal Vision 30s x 10,000 videos Radar Leading vehicle Position Speed Position Signal Object detection model Traﬃc light VLM 40

Taking situational understanding with VLMs a step further by enabling
the model to directly output driving actions. Arai, Hidehisa, et al. "CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving." arXiv preprint arXiv:2408.10845 (2024) CoVLA [Arai+ 2024] Ground truth caption: The ego vehicle is moving straight at a moderate speed following leading car with acceleration. There is a traﬃc light near the ego vehicle displaying a green signal. … Predicted caption: The ego vehicle is moving at a moderate speed and turning right. There is a traﬃc light near the ego vehicle displaying a green signal. … VLAMが予測した軌跡実際の軌跡 41

Roadmap for the VLA model. 1. VLM 2. Driving data
3. VLA models Build a large-scale dataset and train a state-of-the-art open model. Collect and curate 3,000 hours of 3D data. Spatial awareness and understanding the physical world = Embodied AI JAVLA-Dataset Heron-VILA-13B 42

GameNGen [Valevski+ 2024] Build a real-time world model using diﬀusion
models. Valevski, Dani, et al. "Diffusion Models Are Real-Time Game Engines." arXiv preprint arXiv:2408.14837 (2024). https://www.youtube.com/watch?v=O3616ZFGpqw 43

World Model [Ha+ 2018] A model that constructs internal representations
to understand, predict, and learn from the surrounding environment. = Inner modal VMC Model [D. Ha+ 2018] Abstracting oneself riding a bicycle. 44

GAIA-1 [Hu+ 2023] A world model for autonomous driving that
predicts driving states and generates future visuals. • Extended to multimodal capabilities, including language and video. ◦ Convert videos into discrete tokens to be processed by Transformers like language tokens. ( GAIA-1 Action conditioning ) 45

Terra [Arai+ 2024] • Can generate outputs based on any
speciﬁed driving route. • Exhibits very high instruction-following capability. Current scene Driving route ＋ 46

“The fundamental approach surpasses traditional systems.”        48
The Shogi AI “Ponanza,” developed by CEO Yamamoto, improved at a pace surpassing rule-based system through machine learning. In 2017, it became the ﬁrst in Japan to defeat a reigning Shogi Grandmaster.  CEO Yamamoto and the Shogi AI “Ponanza” Performance Technological Progress Today Exponential growth AI Model Linear growth Rule-based Model

Trends in Multimodal Models and Autonomous Driving

Trends in Multimodal Models and Autonomous Driving

More Decks by Yu Yamaguchi

Featured

Transcript