or its affiliates. All rights reserved. Amazon Confidential and Trademark. References: Robotics FM 1. 基盤モデルとロボットの融合 マルチモーダル AI でロボットはどう変わるのか (KS 理⼯学専⾨書)、河原塚健⼈、松嶋達也 (著) https://www.amazon.co.jp/dp/4065395852 2. Robotics Foundation Models (RFMs) 1. R3M https://arxiv.org/abs/2203.12601 - ResNet-based model trained with egocentric data 2. MVP https://arxiv.org/abs/2210.03109 - Vision Transformer (ViT)-based 3. Visual Cortex-1 https://arxiv.org/abs/2303.18240 - ViT-based 4. PaLM-E https://arxiv.org/abs/2303.03378 embodiment by embedding (e.g. image with ViT) concatenated with PaLM 5. RoboVQA https://arxiv.org/abs/2311.00899 - post-training of vision language model (VLM) VideoCoCa 383M with annotated action data 6. MT-Opt https://arxiv.org/abs/2104.08212 - reinforcement learning (RL) with robot arm data 7. Robotics Transformer (RT)-1 https://arxiv.org/abs/2212.06817, RT-2, and RT-X https://arxiv.org/abs/2310.08864 with Open X-Embodiment (OXE) dataset https://robotics-transformer-x.github.io 8. RT-Trajectory, RT-Sketch, Auto-RT 9. Octo https://arxiv.org/abs/2405.12213 - modular architecture based on RT-X 10. OpenVLA https://arxiv.org/abs/2406.09246 - open Vision-Langage-Action (VLA) model based on ViT (DinoV2 + SigLIP) + Llama2 7B trained on OXE dataset 11. RDT-1B https://arxiv.org/abs/2410.07864 - diffusion-based 12. π0 https://arxiv.org/abs/2410.24164v1, π0.5 https://arxiv.org/abs/2504.16054 - by Physical Intelligence (PI) 13. NoMaD 14. In-context Robot Transformer (ICRT) https://arxiv.org/abs/2408.15980 15. GraspVLA https://arxiv.org/abs/2505.03233 - pre-trained on billion-scale synthetic data 3. Dataset 1. Bridge v2 https://rail-berkeley.github.io/bridgedata/ 2. OXE https://robotics-transformer-x.github.io/ 3. DROID https://droid-dataset.github.io 4. Data Capture System 1. ALOHA 2. GELLO 3. UMI 4. Dobb-E