AWS Summit New York 2024: CMP 301 Demystifying the ML software stack on Amazon EC2 accelerated instances

© 2024, Amazon Web Services, Inc. or its affiliates. All
rights reserved. N E W Y O R K C I T Y | J U L Y 1 0 , 2 0 2 4

rights reserved. Demystifying the ML software stack on Amazon EC2 accelerated instances Keita Watanabe, Ph.D. C M P 3 0 1 Senior Specialist Solutions Architect WWSO GenAI - Frameworks Amazon Web Services Ankur Srivastava, Ph.D. Senior Specialist Solutions Architect WWSO GenAI - Frameworks Amazon Web Services

rights reserved. I. Training and inference compute and software stacks II. Container and AMI: Where the wild things are III. How about the AWS Deep Learning AMIs (DLAMI) and AWS Deep Learning Containers IV. Diving into Amazon EKS and AWS ParallelCluster (Slurm) V. Wrap up Outline

rights reserved. Broad and deep accelerated computing portfolio M A C H I N E L E A R N I N G T R A I N I N G I N F E R E N C E P4d NVIDIA A100 TENSOR CORE P4de NVIDIA A100 TENSOR CORE P5 NVIDIA H100 TENSOR CORE DL1 INTEL HABANA GAUDI TRN1n AWS TRAINIUM G6 NVIDIA L4 TENSOR CORE INF1 AWS INFERENTIA INF2 AWS INFERENTIA2 N E W N E W N E W G6e NVIDIA L40S TENSOR CORE DL2q QUALCOMM AI 100 N E W N E W N E W TRN2 AWS TRAINIUM2 N E W G5 A10G TENSOR CORE GPUS

rights reserved. Interconnect matters Model shard All-gather Forward (local) All-gather Backward (local) Reduce- scatter Update weights (local) Model shard All-gather Forward (local) All-gather Backward (local) Reduce- scatter Update weights (local) Accelerator GPU 0 Accelerator GPU 1 Data Fully sharded data parallel: Faster AI workload for fewer accelerators Gart h er W eigh ts Gart h er W eigh ts Syn c Grad s

rights reserved. Distributed training architecture and AWS services AWS Services Amazon EKS AWS ParallelCluster Amazon SageMaker HyperPod Resilient and persistent clusters Slurm HPC clusters Kubernetes clusters Placement group Availability Zone Tightly coupled, communication heavy and inter-node latency sensitive Architecture Elastic Fabric Adapter Amazon FSx for Lustre /fsx

rights reserved. High-speed networking: Elastic Fabric Adapter Elastic Fabric Adapter (EFA) is an OS-bypass, RDMA-enabled network adapter for high-performance inter-node communication. It uses an AWS-designed transport named Scalable Reliable Datagram (SRD). A Endpoint B Endpoint Flow 1 Flow 2 Flow 3 SRD: How it works

rights reserved. Distributed training software stack (GPU) ML frameworks Communication libraries・SDKs Hardware drivers EC2 instance

rights reserved. Distributed training software stack (Neuron) ML frameworks Communication libraries・SDKs Hardware drivers EC2 instance

rights reserved. What is on AMI ? (GPU) Hardware drivers EC2 instance AMI Container toolkits

rights reserved. What is in container (GPU)? EC2 instance Container ML Frameworks Communication libraries・SDKs AMI

rights reserved. What is on AMI ? (Neuron) Hardware drivers EC2 instance aws-neuronx-oci-hook AMI Container toolkits SDK

rights reserved. What is in Container (Neuron)? EC2 instance Container AMI ML Frameworks Communication libraries・SDKs

rights reserved. Call to action

rights reserved. Best practices for large-scale distributed training Step-by-step guides to create clusters: Recipes to customize AMIs AWS-optimized Dockerfiles EFA cheatsheet Distributed training examples • One-click VPC deployments • Mount Fsxfor Lustre Filesystems • EFA-enabled clusters Validation (NCCL tests, etc.) Observability (Prometheus-Grafana, etc.) Profiling (Nsight product family) • Slurm scripts/K8 materials • Working with Pyxis/Enroot • Nemo (MegatronLM, Multimodal, Bionemo) • MosaicML • DDP, FSDP • SMDP, SMMP • Tensorflow/Jax

rights reserved. Thank you! Please complete the session survey in the mobile app Keita Watanabe, Ph.D. [email protected] Ankur Srivastava, Ph.D. [email protected]

rights reserved. Anatomy of GPU Stacks

rights reserved. AWS Deep Learning AMIs (DLAMI) Preconfigured with popular deep learning frameworks and interfaces Optimized for performance with latest NVIDIA driver, CUDA libraries, and Intel libraries Choosing DLAMI • Deep Learning AMI with Conda – dedicated Conda environment for each framework • Deep Learning Base AMI – no frameworks

rights reserved. AWS Deep Learning Containers Prepackaged ML framework container images fully configured and validated Includes AWS optimizations for TensorFlow, PyTorch, MXNet, and Hugging Face github.com/aws/deep-learning-containers

rights reserved. Containers structure and storage Prefer Amazon ECR or Amazon S3 for Batch § DockerHub throttles under load, private registries can suffer us-west-2 SDK Registry AWS Batch DockerHub Github Scientist Fewer, even layers is better § Docker requests layers in parallel (1 layer = 1 request) § Even layers provide a better load distribution

rights reserved. Using machine images and containers Data type Size Change frequency Input data 20 GB (r+w) 20 min per job Configurations 3 MB Weekly Application 1 GB 5 min Application libraries 4 GB Weekly Core dependencies 5 GB Biweekly Operating system 500 MB Monthly Operating system Core dependencies Application libraries Application Configurations Data machine images machine images machine images Container Container Runtime

AWS Summit New York 2024: CMP 301 Demystifying ...

AWS Summit New York 2024: CMP 301 Demystifying the ML software stack on Amazon EC2 accelerated instances

Keita Watanabe

More Decks by Keita Watanabe

Other Decks in Technology

Featured

Transcript

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All