Scalable Infrastructure for Large-Scale AI Training with AWS Sagemaker Hyperpod @ Singapore AI Hour

© 2025, Amazon Web Services, Inc. or its affiliates. All
rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Scalable Infrastructure for Large-Scale AI Training with AWS Sagemaker Hyperpod Keita Watanabe Sr. WW Solutions Architect, GenAI

rights reserved. Amazon Nova Foundation Models State-of-the-art foundation models that deliver frontier intelligence and industry- leading price performance Amazon Nova Lite Amazon Nova Premier Amazon Nova Pro Amazon Nova Micro U N D E R S T A N D I N G M O D E L S Amazon Nova Reel Amazon Nova Canvas C R E A T I V E C O N T E N T G E N E R A T I O N M O D E L S S P E E C H - T O - S P E E C H M O D E L Amazon Nova Sonic

rights reserved. Training Amazon Nova models Data Processing Large-Scale Training Compression Distillation Model Vending Customer Use Cases

rights reserved. Amazon Nova Training Stack Orchestration & Observability • Resource Orchestrator • Job Scheduler Algorithms & Software • ML Frameworks HyperPod Network • Wide bandwidth interconnect • Fast accelerator with large device memory Compute • Scalable distributed file storage Storage Amazon EC2 UltraClusters Infrastructures OBSERVABILITY Prometheus CloudWatch Grafana Amazon EKS NEMO JAX PyTorch NxD

© 2025, Amazon Web Services, Inc. or its aﬃliates. All
rights reserved. Foundation model training needs compute at scale Petabytes of unlabeled data + Millions of GPU Hours Foundation models Billions of parameters = Llama-3 70B used 6.4M1 H100 GPU hours ≈ 256xp5 for 132 days Example Falcon-180B used 7.0M2 A100 GPU hours ≈ 512xp4de for 73 days Example Source: 1https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md 2https://arxiv.org/pdf/2311.16867

rights reserved. 1. Compute requirements Llama3 70B training requires more than 1.2 TB of VRAM Scaling low [1]: FLOPS ≈ 6 x Parameters x Tokens Chinchilla low[2] Models needs to be trained with 20 x (Num. Parameters) Tokens 7 Parameters FLOPS Tokens 1 Billion 1.21e+20 20.2 Billion 10 Billion 1.23e+23 205.1 Billion 175 Billion 3.85e+24 3.7 Trillion 1 Trillion 1.27e+26 21.2 Trillion 10 Trillion 1.30e+28 216.2 Trillion Parameters (FP32/Bf16) 420 GB Gradients (FP32) 280 GB Adam Optimizer States (FP32) 560 GB VRAM consumption Llama3 70B （Without Activations etc.） [1] Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J. and Amodei, D., 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361. [2] Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D.D.L., Hendricks, L.A., Welbl, J., Clark, A. and Hennigan, T., 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.

rights reserved. 2. Pretraining requires communication 8 Model Release Time Size Tokens Hardware AWS Instance 換算 BLOOM Nov-2022 175B 366 Billion 384xA100 80GB 48xP4de.24xlarge Pythia Apr-2023 12B 300 Billion 256xA100 40 GB 32xP4d.24xlarge Llama Feb-2023 65B 1 Trillion 512xA100 40GB 64xP4d.24xlarge Llama2 Jul-2023 70B 2 Trillion 2000xA100 80GB 250xP4de.24xlarge [1] Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z. and Du, Y., 2023. A survey of large language models. arXiv preprint arXiv:2303.18223. Multi-node distributed training is indispensable.

rights reserved. 3. Distributed file storage requirements 9 Data Tokens Size(Bytes) Wikitext 100 M~ 750 MB C4.EN (Colossal Clean Crawled Corpus) 156 B 305 GB RedPajama-Data-1T 1 T 5 TB RedPajama-Data-v2 30 T 170 TB [1] https://arxiv.org/abs/2104.08758 [1] https://huggingface.co/bigscience/bloom Large scale, high speed distributed storage is required You need to store parameters and optimizer states. Ex.: Llama3 70B - Parameters: 420 GB Optimizer States: 560 GB Ex: Bloom 175 B checkpoints including optimizer states : 2.2 TB [1] Parameters (FP32/Bf16) 420 GB Adam Optimizer States (FP32) 560 GB Llama3 70B Checkpoints 内訳 Large corpus needed for FM pretraining

rights reserved. Building blocks AWS Offering Architecture & Orchestration • Resource Orchestrator • Job Scheduler Algorithms & Software • ML Frameworks Amazon EKS AWS ParallelCluster Amazon Sagemaker HyperPod Network • Wide bandwidth interconnect • Fast accelerator with large device memory Compute • Scalable distributed file storage Storage . . . Frameworks Amazon EC2 UltraClusters Infrastructures

rights reserved. Amazon EC2 UltraClusters 12 Super computer supporting high performance computing, networking, and storage Network Compute Storage P5 NVIDIA H100/H200 TENSOR CORE P6 NVIDIA B200 TENSOR CORE TRN1 AWS TRAINIUM TRN2 AWS TRAINIUM2 Elastic Fabric Adapter FSx for Lustre Amazon S3

rights reserved. NVIDIA GPU instances 13 © 2024, Amazon Web Services, Inc. or its aﬃliates. All rights reserved. Instance Size ACC NUM ACC ACC Memory vCPU RAM Local SSDs P5.48xlarge H100 8 80 GB x 8 640 GB 192 2 TiB 8 x 3.84 TB NVMe SSD P5e.48xlarge H200 8 141 GB x 8 1128 GB 192 2 TiB 8 x 3.84 TB NVMe SSD P5en.48xlarge H200 8 141 GB x 8 1128 GB 192 2 TiB 8 x 3.84 TB NVMe SSD P6-B200 B200 8x B200 1440 GiB - - - P6e-GB200 GB200 72 x B200 13.8 TB - - - https://aws.amazon.com/ec2/instance-types/p5/ P6 NVIDIA H100/H200 TENSOR CORE P5 NVIDIA B200 TENSOR CORE

rights reserved. Trainium Instances 14 © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Instance Size ACC NUM ACC ACC Memory vCPU RAM Local SSDs Trn1.32xlarge Trainium1 16 512 GB 128 512 GB 8 TB Trn1n.32xlarge Trainium1 16 512 GB 128 512 GB 8 TB Trn2.48xlarge Trainium2 16 96 GiB x 16 1.5 TB 192 2 TB 4 x 1.92 TB NVMe SSD Trn2u.48xlarge Trainium2 16 96 GiB x 16 1.5 TB 192 2 TB 4 x 1.92 TB NVMe SSD https://aws.amazon.com/ec2/instance-types/trn1 https://aws.amazon.com/ec2/instance-types/trn2 https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trainium1.html https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trainium2.html TRN1 AWS TRAINIUM TRN2 AWS TRAINIUM2

rights reserved. GPT Pretraining using PyTorch FSDP with/without EFA https://medium.com/pytorch/training-a-1-trillion-parameter-model- with-pytorch-fully-sharded-data-parallel-on-aws-3ac13aa96cff 記事公開⽇: 2022/05/16 2.5x 25.6x faster 512 GPUs = 64 instances

rights reserved. Elastic Fabric Adapter (EFA) • SRD protocol purpose-built for scalability in the cloud • Kernel bypass and GPU-direct RDMA for low-latency, high-throughput communication between GPUs • Continuing improvements in latency and completion times Elastic Fabric Adapter

rights reserved. EFA (Elastic Fabric Adapter) Dedicated network interface for MPI/NCCL High bandwidth low latency communication with SRD（Scalable Reliable Diagram） A Endpoint B Endpoint Flow 1 Flow 2 Flow 3 OS Kernel bypassed communication with Libfabric API Out of order での転送： Head of blocking 問題を回避マルチパスルーティングによる安定した低レイテンシーの実現 Elastic Fabric Adapter

rights reserved. Network performance of FM training instances Model Accelerator Num Acc. Total memory Acc. P2P BW EFA P5.48xlarge H100 8 640 GB 900 GB/s 3200 Gbps P5e.48xlarge H200 8 1128 GB 900 GB/s 3200 Gbps P6-B200 B200 8 1440 GiB - 3200 Gbps P6-GB200 B200 72 13.8 TB - 3200 Gbps P5en.48xlarge H200 8 1128 GB 900 GB/s 3200 Gbps Trn1.32xlarge Tranium1 16 512 GB 768 Gbps 800 Gbps Trn1n.32xlarge Trainium1 16 512 GB 768 Gbps 1600 Gbps

rights reserved. Amazon FSx for Lustre • Fully managed Lustre ﬁle system for high performance workloads • POSIX ﬁle system compatible • Native integration with Amazon S3 Transparently access to the data on S3 through Lustre Data created on Lustre is persisted in Amazon S3 Amazon FSx for Lustre Amazon S3

rights reserved. Amazon S3 lazy load example /file1.txt /file2.txt /folder1/file3.txt /folder2/file4.txt s3://bucket/file1.txt s3://bucket/file2.txt s3://bucket/folder1/file3.txt s3://bucket/folder2/file4.txt

rights reserved. ML training storage hierarchy Object us-east-1a Region Instance Store • Checkpoints, temporary data FSx for Lustre • Shared data sets, checkpoints, outputs Amazon S3 • Data backbone, datasets, checkpoints, outputs

rights reserved. How can we leverage the infrastructure to train FMs?

rights reserved. Slurm vs. Kubernetes

rights reserved. Slurm Architecture 26 Cluster Engineers & Researchers Compute Nodes FSxL Datasets & checkpoints AWS Cloud Head Node Login Node

rights reserved. HPC Cluster - AWS ParallelCluster Private subnet Public subnet us-east-1b us-east-1 p4d.24xlarge Compute Fleet Amazon FSx for Lustre /fsx Users AWS Cloud AWS ParallelCluster Head-node • 1 × c5.9xlarge36 vCPUs (18 physical) • 72 GB of memory Compute Node • 100+ × p4de.24xlarge + C6, M6, R6 • 96 vCPUs (48 physical) • 1152 GB of memory • 8 × NVIDIA A100 80GB GPUs • Network: 400Gbs ENA & EFA • Storage: 8 × 1TB NVMe + EBS Shared file-systems • Amazon FSx for Lustre of 108TB on /fsx Cluster Stack • Slurm 22.05.5 • Cuda 11.6

rights reserved. How to create a cluster https://aws.amazon.com/hpc/parallelcluster/ pcluster create-cluster –f config.yaml … Head-node Compute nodes Shared storage p4d.24xlarge Compute Fleet Amazon FSx for Lustre /fsx AWS ParallelCluster

rights reserved. enroot import \ -o /apps/nccl.sqsh \ dockerd://nccl-tests:latest Slurm Job submission 29 Submit training jobs Shell script with resources, commands to execute. Submit via sbatch1, for long running jobs, high control & many jobs. #SBATCH --nodes=4 #SBATCH --job-name=train-llama2 #SBATCH --output=logs/%x_%j.out #SBATCH --ntasks-per-node=8 #SBATCH --exclusive echo "Starting training job" srun python train.py Quick prototyping Book resources via salloc 2 and run commands interactively in parallel or srun salloc -N 4 --exclusive srun python train.py Submit container job Submit container job with Enroot/Pyxis docker build \ -f nccl-tests.Dockerfile \ –t nccl-tests:latest . srun \ --container-image nccl.sqsh \ all_reduce_perf -b 8 -e 16G \ -f 2 -g 1 -c 1 -n 100

rights reserved. CLI Distributed training and inference jobs on thousands of AI accelerators Effective orchestration for machine learning ü Highly scalable across thousands of instances ü Improved utilization of compute resources ü Broad ecosystem of open source and proprietary tools EKS for foundation model (FM) training Amazon EKS EKS VPC Control Plane Data Plane USER VPC Compute Nodes Elastic Fabric Adapter p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 p5 Amazon ECR Amazon FSx for Lustre Amazon CloudWatch

rights reserved. TorchElastic Training Job • Launch pods to run model training scripts: • > kubectl apply -f imagenet-fsx.yaml apiVersion: elastic.pytorch.org/v1alpha1 kind: ElasticJob metadata: name: imagenet spec: # Use "etcd-service:2379" if you already apply etcd.yaml rdzvEndpoint: etcd-service:2379 minReplicas: 1 maxReplicas: 128 replicaSpecs: Worker: replicas: 4 restartPolicy: ExitCode template: apiVersion: v1 kind: Pod spec: nodeSelector: beta.kubernetes.io/instance-type: p3.8xlarge containers: resources: limits: nvidia.com/gpu: 4 volumeMounts: - name: fsx-pvc mountPath: /fsx-shared - name: dshm mountPath: /dev/shm volumes: - name: fsx-pvc persistentVolumeClaim: claimName: fsx-claim - name: dshm emptyDir: medium: Memory containers: - name: elasticjob-worker image: torchelastic/examples:0.2.0 imagePullPolicy: Always env: - name: NCCL_DEBUG value: INFO args: - "--nproc_per_node=4" - "/workspace/examples/imagenet/main.py" - "--arch=resnet50" - "--epochs=1" - "--batch-size=64" - "--workers=8" - "--checkpoint-file=/fsx-shared/checkpoint.pth.tar" - "/fsx-shared/ILSVRC/Data/CLS-LOC/" Training

rights reserved. 4. Why is resiliency so important? Increases time to train Wasted compute resources and $$ Wasted engineering hours to isolate, fix, and resume Source: https://ai.meta.com/research/publications/the-llama-3-herd-of-models/ “During a 54-day snapshot period of pre-training, we experienced a total of 466 job interruptions” * Internal benchmarking Wasted training time by cluster size Meta’s Llama3.1 Training on 16k GPUs

rights reserved. Entropy always wins Tens of thousands of accelerators Thousands of hosts 3–4 months continuous operation

rights reserved. Climbing over the walls Split individual layers across accelerators Wide layers (MLP) or computational expensive layers (attention layers!) Tensor parallelism Pipeline groups of layers across accelerators Deep Models Pipeline parallelism Model fits into single GPU (our happy place) Single accelerator

rights reserved. Climbing over the walls Replica 1 Replica 2 Replica 3 Replica 4 Multiple model replicas with sharded data Data parallelism

rights reserved. Why is distributed training hard? Map Reduce Mostly independent, unidirectional Distributed training Highly interdependent, bi-directional

rights reserved. Deep learning, deep sighing Cosmic rays Silent data corruption (SDC) Falling off of the PCI bus

rights reserved. Living with entropy Accept and design for chaos Measure everything Fail fast and recover (fast) 1. Burn-in 2. Passive / active monitoring 3. Save as often as possible 4. Hardware redundancy 1. Fail (and recover fast) at the first sign of trouble 2. Prioritize efficient recovery 1. Minimize initialization time (infra + framework) 2. Optimize checkpoint frequency 1. Collect metrics 1. Model | framework 2. Collectives (NCCL) 3. Host / hardware 2. Meaningfully visualize 3. Know your KPIs / goodput

rights reserved. Amazon SageMaker HyperPod: It’s dangerous to go alone! • Good news! Amazon SageMaker HyperPod makes a lot of this just work HyperPod Amazon SageMaker Adventurous ML Teams 😎

rights reserved. Job auto-healing with checkpointing CHECKPOINT=(ls -ltr $CHECK_PATH \ | grep '^d' | tail -1) srun --auto-resume=1 \ python training.py \ --checkpoint ${CHECKPOINT} echo "Starting training job” srun python check_step1.py bash run_training.sh Checkpoints Restore Checkpoints Alarm & interruption Instance Restore Self-healing process

rights reserved. HyperPod observability HyperPod cluster Compute Nodes Job queue Accelerator observability Cluster observability Maximize accelerator utilization for specific applications Maximize cluster utilization across applications

rights reserved. Customer Account Service Observability Architecture Engineers & Researchers Compute Nodes Admin & Ops Datasets & checkpoints Endpoint AWS Cloud Head-node FSxL Amazon Managed Service for Prometheus Amazon Managed Grafana AWS IAM Identity Center

rights reserved. FSx Lustre Dashboard 43 https://grafana.com/grafana/dashboards/20906-fsx-lustre/

rights reserved. DCGM Metrics 44

rights reserved. EFA Traffic 45

rights reserved. De-mystifying ML Software Stack on AWS

rights reserved. Distributed training software stack (GPU) ML Frameworks Communication libraries/SDKs Hardware drivers EC2 instance

rights reserved. Distributed training software stack (Neuron) ML Frameworks Communication libraries・SDKs Hardware drivers EC2 instance

rights reserved. Demystifying ML Software Stack on AWS

rights reserved. Call to Action awsome-distributed training (Open-source repository for reference architectures & test cases) • https://github.com/aws-samples/awsome-distributed-training • Reference architectures for AWS ParallelCluster/Amazon EKS/Amazon SageMaker HyperPod • Test cases for various distributed training frameworks, such as Megatron-Core, Nemo, and PyTorch FSDP • Validation (NCCL tests) • Observability (Prometheus&Grafana) Workshops • Machine Learning on ParallelCluster: https://catalog.workshops.aws/ml-on- aws-parallelcluster/en-US • SageMaker HyperPod Slurm Workshop: https://catalog.workshops.aws/sagemaker-hyperpod • SageMaker HyperPod EKS Workshop: https://catalog.workshops.aws/sagemaker-hyperpod-eks

rights reserved. Basic health checks § GPU: – NVIDIA DCGM Diag command (level-2) – Check GPU status with nvidia-smi command § Trainium – Check NPU status/metrics from /sys/devices/virtual/neuron_device/* § EFA – Run EFA health checker to test network connectivity between EFAs on the instance § CPU – With Linux’s “stress” command for CPU stress testing (CPU / IO /memory allocation with many threads)

rights reserved. Deep health checks § GPU: – Verifies GPU/NVLink counts. – NVIDIA DCGM Diag command (level-4) + memory test § Trainium – Read counters from Neuron sysfs – Runs a training workload to produce numbers § NCCL test (GPU) – Verifies the performance of collective communication operations on multiple NVIDIA GPUs § NCCOM test (Trainium) – Verifies the performance of collective communication operations on multiple Trainium nodes

rights reserved. Diﬀerence Slurm ↔ EKS Feature Slurm EKS Control plane ownership Controller node is a part of HyperPod cluster Control plane is owned by EKS Login to cluster Customers need to login in to head node or login node to run Slurm commands Customers can run kubectl from remote machines Cluster metrics observability Customers can setup observability stack at application level, but it required extra effort. Customers can use CloudWatch container insights as a first class observability feature When resiliency runs Instance replacement is triggered only when `-- auto-resume=1` is specified to srun command, and when hardware failed. Health monitoring runs in the background, and instance reboot/replacement happens anytime hardware issue is detected. Deep health checks Deep health check doesn't exist Customers can enable deep health checks Instance reboot Customers can reboot by hitting "sudo reboot" command on the instance, but there is no way to reboot unresponsive instances. Customers can reboot instances by setting node label and it works even if the instance is unresponsive. Task governance You can use Slurm’s QoS feature HyperPod task governance is supported for scheduling/priority/dynamic compute resource allocation HyperPod CLI Not supported Supported

rights reserved. How to scale FM training.

rights reserved. 58 Distributed training strategies Data Parallel (DP) Model Parallel (MP) Pipeline Parallel Tensor Parallel Fully Sharded Data Parallel (FSDP) /ZeRO Data Parallel = Same Model, Different Data Model Parallel = Same Data, Different Model Other Parallelism Context Parallel

rights reserved. Fully-Sharded Data Parallel (FSDP) Zero Redundancy Optimizer (Zero) Model Shard All-Gather Forward (local) All-Gather Backward (local) Reduce- Scatter Update weights (local) Model Shard All-Gather Forward (local) All-Gather Backward (local) Reduce- Scatter Update weights (local) Accelerator GPU 0 Accelerator GPU 1 Data Gart h er W eigh ts Gart h er W eigh ts Syn c Grad s

rights reserved. Tensor Parallel MLP

rights reserved. Pipeline Parallel 61

rights reserved. 62

rights reserved. 基盤モデル構築における“計算” • Forward Backward • Collective Communication

rights reserved. この計算には何が必要か • Forward Backward • Collective Communication

rights reserved. Distributed training software stack (GPU) ML frameworks Communication libraries・SDKs Hardware drivers EC2 instance

rights reserved. Session ID: HPC05 Session Survey https://app.smartsheet.com/b/form/d0a54d4ccc284c7481def7fe72e8f420

Scalable Infrastructure for Large-Scale AI Trai...

Scalable Infrastructure for Large-Scale AI Training with AWS Sagemaker Hyperpod @ Singapore AI Hour

More Decks by Keita Watanabe

Featured

Transcript