Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AWS Summit New York 2024: CMP 301 Demystifying ...

AWS Summit New York 2024: CMP 301 Demystifying the ML software stack on Amazon EC2 accelerated instances

Amazon EC2 offers the most extensive range of accelerators in the cloud for running machine learning workloads. Whether utilizing EC2 instances powered by NVIDIA GPUs or AWS Trainium, managing the software stack can be challenging. Key considerations include what to include in the Amazon Machine Image (AMI) and what to place in the container.
In this chalk-talk, we will explore the software stack for various accelerators, diving into AMIs and practical techniques for building and managing your software stack. This session will provide insights from real-world experiences to help you optimize your machine learning infrastructure on AWS.

Keita Watanabe

July 03, 2024
Tweet

More Decks by Keita Watanabe

Other Decks in Technology

Transcript

  1. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. N E W Y O R K C I T Y | J U L Y 1 0 , 2 0 2 4
  2. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Demystifying the ML software stack on Amazon EC2 accelerated instances Keita Watanabe, Ph.D. C M P 3 0 1 Senior Specialist Solutions Architect WWSO GenAI - Frameworks Amazon Web Services Ankur Srivastava, Ph.D. Senior Specialist Solutions Architect WWSO GenAI - Frameworks Amazon Web Services
  3. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. I. Training and inference compute and software stacks II. Container and AMI: Where the wild things are III. How about the AWS Deep Learning AMIs (DLAMI) and AWS Deep Learning Containers IV. Diving into Amazon EKS and AWS ParallelCluster (Slurm) V. Wrap up Outline
  4. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Broad and deep accelerated computing portfolio M A C H I N E L E A R N I N G T R A I N I N G I N F E R E N C E P4d NVIDIA A100 TENSOR CORE P4de NVIDIA A100 TENSOR CORE P5 NVIDIA H100 TENSOR CORE DL1 INTEL HABANA GAUDI TRN1n AWS TRAINIUM G6 NVIDIA L4 TENSOR CORE INF1 AWS INFERENTIA INF2 AWS INFERENTIA2 N E W N E W N E W G6e NVIDIA L40S TENSOR CORE DL2q QUALCOMM AI 100 N E W N E W N E W TRN2 AWS TRAINIUM2 N E W G5 A10G TENSOR CORE GPUS
  5. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Interconnect matters Model shard All-gather Forward (local) All-gather Backward (local) Reduce- scatter Update weights (local) Model shard All-gather Forward (local) All-gather Backward (local) Reduce- scatter Update weights (local) Accelerator GPU 0 Accelerator GPU 1 Data Fully sharded data parallel: Faster AI workload for fewer accelerators Gart h er W eigh ts Gart h er W eigh ts Syn c Grad s
  6. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Distributed training architecture and AWS services AWS Services Amazon EKS AWS ParallelCluster Amazon SageMaker HyperPod Resilient and persistent clusters Slurm HPC clusters Kubernetes clusters Placement group Availability Zone Tightly coupled, communication heavy and inter-node latency sensitive Architecture Elastic Fabric Adapter Amazon FSx for Lustre /fsx
  7. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. High-speed networking: Elastic Fabric Adapter Elastic Fabric Adapter (EFA) is an OS-bypass, RDMA-enabled network adapter for high-performance inter-node communication. It uses an AWS-designed transport named Scalable Reliable Datagram (SRD). A Endpoint B Endpoint Flow 1 Flow 2 Flow 3 SRD: How it works
  8. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Distributed training software stack (GPU) ML frameworks Communication libraries・SDKs Hardware drivers EC2 instance
  9. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Distributed training software stack (Neuron) ML frameworks Communication libraries・SDKs Hardware drivers EC2 instance
  10. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. What is on AMI ? (GPU) Hardware drivers EC2 instance AMI Container toolkits
  11. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. What is in container (GPU)? EC2 instance Container ML Frameworks Communication libraries・SDKs AMI
  12. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. What is on AMI ? (Neuron) Hardware drivers EC2 instance aws-neuronx-oci-hook AMI Container toolkits SDK
  13. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. What is in Container (Neuron)? EC2 instance Container AMI ML Frameworks Communication libraries・SDKs
  14. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Best practices for large-scale distributed training Step-by-step guides to create clusters: Recipes to customize AMIs AWS-optimized Dockerfiles EFA cheatsheet Distributed training examples • One-click VPC deployments • Mount Fsxfor Lustre Filesystems • EFA-enabled clusters Validation (NCCL tests, etc.) Observability (Prometheus-Grafana, etc.) Profiling (Nsight product family) • Slurm scripts/K8 materials • Working with Pyxis/Enroot • Nemo (MegatronLM, Multimodal, Bionemo) • MosaicML • DDP, FSDP • SMDP, SMMP • Tensorflow/Jax
  15. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Thank you! Please complete the session survey in the mobile app Keita Watanabe, Ph.D. [email protected] Ankur Srivastava, Ph.D. [email protected]
  16. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Anatomy of GPU Stacks
  17. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. AWS Deep Learning AMIs (DLAMI) Preconfigured with popular deep learning frameworks and interfaces Optimized for performance with latest NVIDIA driver, CUDA libraries, and Intel libraries Choosing DLAMI • Deep Learning AMI with Conda – dedicated Conda environment for each framework • Deep Learning Base AMI – no frameworks
  18. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. AWS Deep Learning Containers Prepackaged ML framework container images fully configured and validated Includes AWS optimizations for TensorFlow, PyTorch, MXNet, and Hugging Face github.com/aws/deep-learning-containers
  19. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Containers structure and storage Prefer Amazon ECR or Amazon S3 for Batch § DockerHub throttles under load, private registries can suffer us-west-2 SDK Registry AWS Batch DockerHub Github Scientist Fewer, even layers is better § Docker requests layers in parallel (1 layer = 1 request) § Even layers provide a better load distribution
  20. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Using machine images and containers Data type Size Change frequency Input data 20 GB (r+w) 20 min per job Configurations 3 MB Weekly Application 1 GB 5 min Application libraries 4 GB Weekly Core dependencies 5 GB Biweekly Operating system 500 MB Monthly Operating system Core dependencies Application libraries Application Configurations Data machine images machine images machine images Container Container Runtime