Upgrade to Pro — share decks privately, control downloads, hide ads and more …

re:Invent 2023: CMP332 De-mystifying ML softwar...

re:Invent 2023: CMP332 De-mystifying ML software stack on Amazon EC2 accelerated instances

Amazon EC2 offers the broadest set of accelerators in the cloud to run machine learning workloads. Whether you are using Amazon EC2 instances powered by NVIDIA GPUs AWS Trainium, Inferentia you may wonder how to manage the software stack. What to put in the Amazon Machine Image (AMI)? What goes on the container? How would that be different on Kubernetes with Amazon EKS or on Slurm AWS ParallelCluster? How to make use of the DLAMI. In this chalk-talk we will dive into the software stack for different accelerators, services and containerization systems. Dive into AMIs and techniques to build and manage your software stack from practical experience.

Keita Watanabe

February 20, 2024
Tweet

More Decks by Keita Watanabe

Other Decks in Technology

Transcript

  1. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  2. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. De-mystifying ML software stack on Amazon EC2 accelerated instances Keita Watanabe, Ph.D. C M P 3 3 2 Sr. Specialist Solutions Architect WWSO GenAI - Frameworks Amazon Web Services Pierre-Yves Aquilanti, Ph.D. Head Frameworks Solutions WWSO GenAI – Frameworks Amazon Web Services
  3. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Broad and deep accelerated computing portfolio G5 P4d DL1 G5g P3 P4de Inf1 F1 Inf2 P5 VT1 GPU, AWS ML Accelerators, And FPGA-based EC2 instances GPUs AI/ML accelerators and ASICs FPGAs Trn1 G4 Preview Trainium accelerator Inferentia accelerator Graviton CPU H100, A100, V100 GPU A10G, T4 GPU Gaudi accelerator Radeon GPU Xilinx accelerator Xilinx FPGA 3
  4. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Placement Group Fundamental architectures for training and inference explain hardware component (GPU/Trainium/EFA) Availability Zone Amazon FSx for Lustre /scratch EKS Nodegroup Availability Zone 1 Availability Zone 2 Training Inference • Instances: P5, P4d(e), Trn1, g5 • Scale: POC = 1-64 instances, PROD = 4-100s… • Care for: EFA, EC2 capacity, shared network • Cost objective: cost-to-train ($/iteration) • Instances: G5, g4dn, Inf1, Inf2, CPU based instances • Scale: POC = 1-64 instances, PROD = 1-1000s… • Care for: scaling latency (predictive, metric, capacity) • Cost objective: serving at scale and fast, $/inference Tightly coupled, communication heavy & inter-node latency sensitive Loosely coupled, fast scaling in/out & query latency sensitive
  5. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. HPC Cluster - AWS ParallelCluster Private subnet Public subnet us-east-1b us-east-1 p4d.24xlarge Compute Fleet Amazon FSx for Lustre /fsx Users AWS Cloud AWS ParallelCluster Head-node • 1 × c5.9xlarge36 vCPUs (18 physical) • 72 GB of memory Compute Node • 100+ × p4de.24xlarge + C6, M6, R6 • 96 vCPUs (48 physical) • 1152 GB of memory • 8 × NVIDIA A100 80GB GPUs • Network: 400Gbs ENA & EFA • Storage: 8 × 1TB NVMe + EBS Shared file-systems • Amazon FSx for Lustre of 108TB on /fsx Cluster Stack • Slurm 22.05.5 • Cuda 11.6
  6. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Resiliency in SageMaker HyperPod Agents monitor cluster instance health for issues with CPU, GPU, and network health. Once an agent detects a hardware failure, Scuderia will automatically replace the faulty instance with a healthy one. With the faulty instance replaced, SageMaker HyperPod then requeues the workload in Slurm and reloads the last valid checkpoint to resume processing.
  7. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deep learning open-source training stack with EFA (GPU) Frameworks & optimization libraries Communication Libraries Hardware Driver Launcher Training Framework
  8. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deep learning open-source training stack with EFA (Neuron) Frameworks & optimization libraries Communication Libraries Hardware Driver
  9. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containers & AMI – how we generally think of it Framework Lib A Lib B Lib C Python Software packages Container Registry Docker Telemetry Software packages Operating system AMI
  10. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containers & AMI – how we generally think of it Container Registry AMI Docker Telemetry Software packages Operating system Framework Lib A Lib B Lib C Python Software packages Container Registry AMI Docker Telemetry Software packages Operating system Framework Lib A Lib B Lib C Python Software packages EFA
  11. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. What to be on AMI ? (GPU) Library Notes nvidia gpu driver Docker run: add "--gpus all" to have nvidia-smi inside container. Nvidia fabric manager Nvidia docker Service is the docker daemon. CLI invoked by users. efa driver Install with --minimal. NOTE: won't even have fi_info. Docker run: add "--device /dev/infiniband/uverbs0 ..." ssm agent Docker Telemetry Software packages Operating system AMI
  12. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. What to be on container ? (gpu) Library Notes Libfabric-aws Install with "--skip-kmod --skip-limit-conf". Also provide openmpi prebuilt by aws. CUDA CUDNN pytorch backend cublas pytorch backend nccl pytorch backend. Can be overridden using LD_PRELOAD=.../libnccl.so Aws-ofi-nccl Plugin for NCCL ML Frameworks PyTorch/Tonsorflew/JAX Container Registry Framework Lib A Lib B Lib C Python Software packages
  13. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. What to be on AMI ? (Neuron) Library Notes neuron driver Docker run: add "--device=/dev/neuron0 " to have neuron-ls inside container. docker Service is the docker daemon. CLI invoked by users. efa driver Install with --minimal. NOTE: won't even have fi_info. Docker run: add "--device /dev/infiniband/uverbs0 ..." ssm agent Docker Telemetry Software packages Operating system AMI
  14. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. What to be on container ? (Neuron) Library Notes Libfabric-aws Install with "--skip-kmod --skip-limit-conf". Also provide openmpi prebuilt by aws. aws-neuron(x)- runtime-lib aws-neuron(x)- tools neuron-ls/neuron-top neuron-compiler Run neuron-cc from within a machine learning framework PyTorch Neuron torch-neuronx (Inf2 & Trn1/Trn1n) / torch-neuron (Inf1) aws-neuron- collectives Collective operation with Neuron SDK Distributed training lib neuronx-nemo-megatron/neuronx-distributed Container Registry Framework Lib A Lib B Lib C Python Software packages
  15. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Deep Learning AMIs (DLAMI) • Preconfigured with popular deep learning frameworks and interfaces • Optimized for performance with latest NVIDIA driver, CUDA libraries, and Intel libraries Choosing DLAMI • Deep Learning AMI with Conda – dedicated Conda environment for each framework • Deep Learning Base AMI - no frameworks
  16. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Deep Learning Containers • Prepackaged ML framework container images fully configured and validated • Includes AWS optimizations for TensorFlow, PyTorch, MXNet, and Hugging Face github.com/aws/deep-learning-containers
  17. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you! © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Please complete the session survey in the mobile app Thank you! © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Please complete the session survey in the mobile app Keita Watanabe, Ph.D. [email protected] Pierre-Yves Aquilanti, Ph.D. [email protected]