Generative AI FM Training and Deployment with NVIDIA NeMo and NIM on Amazon SageMaker

© 2025, Amazon Web Services, Inc. or its affiliates. All
rights reserved. Amazon Confidential and Trademark. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark. Generative AI FM Training/Deployment with NVIDIA NeMo/NIM on Amazon SageMaker Yoshitaka Haribara, Ph.D. Sr. GenAI Startup Solutions Architect Amazon Web Services Japan G.K.

rights reserved. Amazon Confidential and Trademark. NVIDIA AI Summit Japan 2024 with Howard Wright (VP, Startups, NVIDIA, ex-AWS) 針原佳貴, Ph.D. (X: @_hariby) - アマゾンウェブサービスジャパン合同会社シニア生成 AI スタートアップソリューションアーキテクト - 大阪大学量子情報・量子生命研究センター (QIQB) 招へい准教授略歴 - 2013年大阪大学理学部数学科卒業 - バンド、ドラム、応用数学・特殊関数 (ベッセル関数) - 2018年東京大学大学院情報理工学系研究科博士課程修了 - 光イジングマシン、SIMD/MIMD/FPGA、量子化ニューラルネット - 2018年 AWS Japan 入社 - クラウド、機械学習・生成 AI、量子コンピュータ、スタートアップ趣味 - バンド・ドラム (YouTube/Instagram: @dr.hariby) - 最近各種ストリーミングで楽曲配信始めました https://www.tunecore.co.jp/artists/hari-psycho-experience

rights reserved. Amazon Confidential and Trademark. Agenda • AWS GenAI Service Stack (EC2 NVIDIA GPU Instances, SageMaker AI) • NeMo on SageMaker HyperPod • NIM on SageMaker AI 3

rights reserved. AWS ⽣成 AI スタック⽣産性を向上させるアプリケーション⽣成 AI アプリを構築するためのモデルとツール AI モデルの構築・学習のためのインフラストラクチャー Amazon Q Business 洞察と⾃動化 Amazon Q Developer ソフトウェア開発ライフサイクル Amazon Bedrock AMAZON のモデル | パートナーのモデル AWS Trainium AWS Inferentia GPUs ハイパフォーマンスコンピューティング ( HP C ) Amazon SageMaker AI マネージドインフラストラクチャー

rights reserved. AWS ⽣成 AI スタック AI モデルの構築・学習のためのインフラストラクチャー AWS Trainium AWS Inferentia GPUs ハイパフォーマンスコンピューティング ( HP C ) Amazon SageMaker AI マネージドインフラストラクチャー

rights reserved 8 G P U 、 A W S M L アクセラレーター、および F P G A ベースの E C 2 インスタンス幅広く深い加速コンピューティングのポートフォリオ PREVIEW AWS Trainium AWS Inferentia B200, H200, H100, A100, L4, L40S, A10G, T4 Gaudi accelerator Radeon GPU Xilinx accelerator Xilinx FPGA DL1 F1 VT1 AI/ML accelerators, ASICs, FPGA DL2q Qualcomm Cloud AI 100 G5 P4de G6 P4d P5 P5e NVIDIA GPUs G4 P6- B200 G6e Inf1 Inf2 Trn1 Trn2 AWS ML chips F2 Trn3 P5en GB200

rights reserved 9 NVIDIA GPU Instances for ML Training P4: NVIDIA A100 • Up to 156 teraflops FP64 compute and up to 640 GB HBM2 • Up to 400 Gbps networking (EFA) and 600 GB/s device- device interconnect P4 P5: NVIDIA H100 • Up to 536 teraflops FP64 compute and up to 640 GB HBM3 • Up to 3,200 Gbps networking (EFA) and 900 GB/s device- device interconnect P5 P5en: NVIDIA H200 • Up to 536 teraflops FP64 compute and up to 1128 GB HBM3e • Up to 3,200 Gbps networking (EFA) and 900 GB/s device- device interconnect P5en P6-B200: NVIDIA B200 • Powered by NVIDIA B200 GPUs • EC2 UltraCluster for accelerating generative AI training and inference at massive scale P6

rights reserved 10 NVIDIA GPU Instances for ML Inference G6e: NVIDIA L40S • Up to 8 NVIDIA L40S GPUs • Up to 384 GB vRAM @ 860 Gb/s G6: NVIDIA L4 • Up to 8 NVIDIA L4 GPUs • Up to 192 GB vRAM @ 300 Gb/s G5: NVIDIA A10G • Up to 8 NVIDIA A10G GPUs • Up to 192 GB vRAM @ 600 Gb/s G5 G6 G6e

rights reserved 11 Upcoming accelerated computing instances Trn3: AWS Trainium3 • Designed to deliver the highest performance, most energy efficient AI model training infrastructure in the cloud Trn3 GB200: • Featuring GB200 NVL72, with 72 Blackwell GPUs and 36 Grace CPUs interconnected by fifth- generation NVIDIA NVLink™ • EC2 UltraCluster connected with Amazon’s powerful networking (EFA) and supported by advanced virtualization (AWS Nitro System) P* DGX Cloud • AI platform co-engineered by AWS and NVIDIA • Powered by NVIDIA GB200 superchips • Access to the infrastructure and software needed to build and deploy advanced generative AI models on AWS DGX

rights reserved 12 Up to 20,000 H200/H100 GPUs (P5) or 100,000 Trainium Accelerators (Trn2) Nonblocking petabit-scale network infrastructure Redesigned for 16x larger scale and lower latency with third-gen EFA High-throughput, low-latency storage from Amazon FSx for Lustre T H E L A R G E S T S C A L E M L I N F R A S T R U C T U R E I N T H E C L O U D Second-generation EC2 UltraClusters Up to 20,000 GPUs Petabytes per second throughput, billions of IOPS 3,200 Gbps Elastic Fabric Adapter (EFA) Petabit-scale nonblocking network infrastructure Scalable low- latency storage Second-generation EC2 UltraClusters

rights reserved 13 13 SQL Analytics Amazon Redshift Data Processing Amazon EMR AWS Glue Amazon Athena Model Development Amazon SageMaker AI Gen AI App Development Amazon Bedrock Streaming Amazon MSK Amazon Kinesis Business Intelligence Amazon QuickSight Search Analytics Amazon OpenSearch Service C O M I N G S O O N C O M I N G S O O N C O M I N G S O O N Unified Studio Data & AI Governance Lakehouse Amazon SageMaker

rights reserved 14 Amazon SageMaker 14 Data and AI governance Governance built in to discover, share, and collaborate on data and AI securely Built on Amazon DataZone Amazon SageMaker Catalog Data Models Gen AI Compute Data & AI Governance

rights reserved 15 15 SageMaker Lakehouse Unify access to all your data Amazon S3 data lakes Amazon Redshift data warehouses Zero-ETL integrations Aurora RDS OpenSearch vector data ServiceNow Salesforce Zoho CRM Instagram Ads SAP Salesforce Pardot Facebook Ads Zendesk DynamoDB Streaming data – MSK, Kinesis Federated querying + 100s of AWS Glue connectors

rights reserved. P U R P O S E - B U I L T I N F R A S T R U C T U R E F O R F M T R A I N I N G SageMaker AI でモデル学習を行う2つの選択肢 Amazon SageMaker HyperPod Fully managed training jobs 大規模で費用対効果の高いトレーニングのためのフルマネージドな耐障害性インフラストラクチャインフラよりもモデル構築に集中従量課金制オプションを備えた柔軟なオンデマンド GPU クラスターへのアクセスを提供最大限のリソース制御のための耐障害性と自己オーケストレーションインフラストラクチャクラスターオーケストレーション (Slurm または EKS) のカスタマイズと管理チーム間でクラスター利用率を最大化するためのワークロードのスケジューリング

rights reserved. Single model deployment Single container Multi-container Multi LoRA adapter hosting Serverless GPUs CPUs Amazon SageMaker AI でのモデルデプロイ Invoke Response Real-time synchronous response Near real-time asynchronous response Invoke Response Offline batch inference Submit Complete SageMaker AI Multi-model deployment Model Container Infrastructure Amazon

rights reserved. Amazon Confidential and Trademark. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark. Model Training with NVIDIA NeMo on SageMaker HyperPod 24

rights reserved. Amazon Confidential and Trademark. NVIDIA NeMo framework on Amazon SageMaker HyperPod clusters This part of guidance demonstrates how to deploy SageMaker HyperPod clusters based on HPC (slurm) Administrators / DevOps Engineers Amazon FSx for Lustre S3 Bucket Controller Node HyperPod Compute Nodes SSH via SSM Data Scientists/ML Engineers VPC Peering Amazon SageMaker HyperPod cluster Elastic Fabric Adapter AWS Cloud Region Customer Account Service Account hpc hpc hpc HyperPod VPC hpc hpc hpc 1 2 3 4 5 6 7 8 1 3 4 5 6 2 7 Account team reserves compute capacity with On-Demand Capacity Reservations (ODCR) or Amazon SageMaker HyperPod Flexible Training Plans Admin/DevOps Engineers use the Sagemaker HyperPod Virtual Private Cloud VPC stack to deploy networking, storage and Identity and Access Management IAM resources Admin/DevOps Engineers push Lifecycle scripts to S3 bucket created in previous step Admin/DevOps Engineers use the Amazon SageMaker cli to create the SageMaker HyperPod cluster Admin/DevOps Engineers generate key-pairs to establish access to the Controller Node of the HyperPod cluster. Once the HyperPod cluster is created, admin can test SSH access to the Controller and Compute nodes and examine the cluster Admin/DevOps Engineers configures IAM to use Amazon Managed Prometheus to collect cluster metrics and Amazon Managed Grafana to set up the observability stack Admin/DevOps Engineers can make further changes to the cluster using the HyperPod CLI Account Team AWS IAM Identity Center Amazon Managed Service Grafana Amazon Managed Service for Prometheus 8 Aws cli Custom er VPC Endpoint

rights reserved. Amazon Confidential and Trademark. Demo Video – Setup NVIDIA NeMo on HyperPod easily 26

rights reserved. Amazon Confidential and Trademark. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark. Model Deplyment with NVIDIA NIM on SageMaker AI 27

rights reserved. Amazon Confidential and Trademark. NVIDIA NIM 28

rights reserved. Amazon Confidential and Trademark. 29 NVIDIA NIM public ECR gallery on AWS https://gallery.ecr.aws/nvidia/nim

rights reserved. Amazon Confidential and Trademark. Deploy with NIM on SageMaker (Mixtral 8x7B) sm_model_name = "nim-mixtral-8x7b-instruct” instance_type = "ml.p4d.24xlarge" container = { "Image": nim_image, "Environment": {"NGC_API_KEY": NGC_API_KEY} } create_model_response = sm.create_model( ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container ) create_endpoint_config_response = sm.create_endpoint_config( EndpointConfigName=sm_model_name, ProductionVariants=[ { "InstanceType": instance_type, "InitialVariantWeight": 1, "InitialInstanceCount": 1, "ModelName": sm_model_name, "VariantName": "AllTraffic", "ContainerStartupHealthCheckTimeoutInSeconds": 850 } ], ) create_endpoint_response = sm.create_endpoint( EndpointName=sm_model_name, EndpointConfigName=sm_model_name ) 30

rights reserved. Amazon Confidential and Trademark. Inference with NIM on SageMaker (Mixtral 8x7B) payload_model = "mistralai/mixtral-8x7b-instruct-v0.1” messages = [ {"role": "user", "content": "Hello! How are you?"}, {"role": "assistant", "content": "Hi! I am quite well, how can I help you today?"}, {"role": "user", "content": "Explain to me in detail what llm serving frameworks are"} ] payload = { "model": payload_model, "messages": messages, "max_tokens": 1024, "stream": True } response = client.invoke_endpoint_with_response_stream( EndpointName=sm_model_name, Body=json.dumps(payload), ContentType="application/json", Accept="application/jsonlines", ) 31

rights reserved. Amazon Confidential and Trademark. References • NeMo • On SageMaker HyperPod https://aws.amazon.com/blogs/machine- learning/running-nvidia-nemo-2-0-framework-on-amazon-sagemaker- hyperpod/ • NIM • NVIDIA AI Enterprise (AWS Marketplace) https://aws.amazon.com/marketplace/pp/prodview-ozgjkov6vq3l6 • Mixtral 8x7B https://github.com/aws-samples/mistral-on- aws/blob/main/notebooks/NIM-inference- samples/mixtral_8x7b_Nvidia_nim.ipynb 32

Generative AI FM Training and Deployment with N...

Generative AI FM Training and Deployment with NVIDIA NeMo and NIM on Amazon SageMaker

Yoshitaka Haribara

More Decks by Yoshitaka Haribara

Other Decks in Technology

Featured

Transcript

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2024, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All