This session explores how AWS SageMaker HyperPod provides scalable, cost-efficient infrastructure for large-scale foundation model training. Drawing from real-world lessons learned during the training of Amazon’s Nova models, we examine the technical architecture of HyperPod—including core infrastructure components on Amazon EC2 UltraCluster such as Elastic Fabric Adapter and Amazon FSx for Lustre—along with optimized training frameworks and automated fault recovery mechanisms designed to maximize performance and minimize downtime. Attendees will gain insights into how HyperPod enables high-throughput, resilient distributed training across thousands of GPUs, helping organizations reduce time-to-train and simplify operational complexity. The session will also highlight a case study on Llama 3.3 Swallow, a 70B-parameter Japanese language model developed by the Institute of Science Tokyo, showcasing how HyperPod can support the development of sovereign and regionally optimized foundation models.