AWS Trainium/Inferentia/Neuron SDK 最新動向2025春

© 2025, Amazon Web Services, Inc. or its affiliates. All
rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Neuron Community - Day One AWS Trainium/Inferentia/Neuron SDK 最新動向2025春常世大史 Annapurna Labs, ML SA 2025/04/09

rights reserved. 自己紹介名前：常世大史 (とこよひろし) 所属：Annapurna Labs (アンナプルナラボ) 職務：アンナプルナラボ発信技術の拡販、技術支援経歴：外資半導体企業を経て、2013 年アンナプルナラボに参加。2015 年の買収に伴い AWS の一員に好きな AWS サービス： AWS 自社設計チップ搭載 EC2 インスタンス Annapurna Labs (アンナプルナラボ) とは... AWS 内の半導体開発部門。AWS Nitro System、 AWS Graviton プロセッサ、AWS Trainium、 Inferentia ML アクセラレータチップを開発 © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

rights reserved. Inf1 インスタンス深層学習の推論を高性能かつ低価格で実行 ※ 推論専用インスタンス Trn1 インスタンス LLM、画像生成モデルの学習におけるコスト効率と高性能の実現 ※ 学習向けインスタンス Inf2 インスタンス LLM、画像生成モデルの推論を高性能かつ低価格で実行 ※ 推論向けインスタンス Trn2 インスタンス深層学習、生成AIに最適な EC2最高性能インスタンス ※ 学習向けインスタンス自社設計 AI アクセラレータチップの歴史 2019 2022 2023 2024

rights reserved. 5 AWS Trainium、Inferentia を活用中のお客様 Anyscale

rights reserved. 6 • 最大16個の AWS Trainium を搭載 • 同等の GPU インスタンスと比較し最大50%低価格を実現 • 512GB の高速 HBM2 メモリ搭載 • 最大 1600 Gbps の EFA ネットワーク帯域に対応 • Trn1上で学習したモデルのデプロイ先は自由 • 3万以上の AWS Trainium アクセラレーターを EC2 UltraCluster にデプロイ、 6 エクサフロップスの演算性能を提供 Amazon EC2 Trn1/Trn1n インスタンス *2025年4月時点の米国東部 (バージニア北部)の価格 https://aws.amazon.com/jp/ec2/instance-types/trn1 AWS 独自設計高性能 ML 学習向けアクセラレーター AWS Trainium を搭載したインスタンスインスタンスサイズ Trainium アクセラレータメモリ vCPU メモリ NeuronLink ネットワーク帯域オンデマンド価格 (USD/時間) Capacity Blocks 価格 (USD/時間) Trn1.2xlarge 1 32 GB 8 32 GB N/A 最大 10 Gbps 1.34 NA Trn1.32xlarge 16 512 GB 128 512 GB Yes 800 Gbps 21.5 9.532 Trn1n.32xlarge 16 512 GB 128 512 GB Yes 1600 Gbps 24.78 NA

rights reserved. 7 • 第2世代 ML推論チップ AWS Inferentia2 を最大12個搭載 • Inf1 と比較して最大 4 倍高いスループット、10 分の１の低レイテンシーを実現 • 384 GB の高速 HBM2 メモリ搭載 • 大規模言語モデル（LLM）を単一サーバー上にデプロイ可能 • 小規模モデルの学習にも対応 Amazon EC2 Inf2 インスタンス *2025年4月時点の米国東部 (バージニア北部)の価格 https://aws.amazon.com/jp/ec2/instance-types/inf2/ 最もコスト効率の高い生成系 AI モデルに対応した推論向けインスタンスインスタンスサイズ Inferentia2 アクセラレータメモリ vCPU メモリ NeuronLink ネットワーク帯域オンデマンド価格 (USD/時間) Inf2.xlarge 1 32 GB 4 16 GB N/A 最大 15 Gbps 0.76 Inf2.8xlarge 1 32 GB 32 128 GB N/A 最大 25 Gbps 1.97 Inf2.24xlarge 6 192 GB 96 384 GB Yes 50 Gbps 6.49 Inf2.48xlarge 12 384 GB 192 768 GB Yes 100 Gbps 12.98

rights reserved. AWS Trainium 第 2 世代 Neuron コア v2 Tensor エンジン：畳み込み等、行列演算に最適化 Scalar エンジン：RELU 等の活性化関数に最適化 Vector エンジン：Batch Normalization やプーリング処理に最適化汎用 SIMD エンジン：カスタムオペレータに対応 HBM HBM NeuronCore-v2 GPSIMD Engine NeuronLink-v2 DMA DMA DMA DMA DMA DMA DMA Collective Communication Host PCIe On-chip SRAM memory Tensor Engine Vector Engine Scalar Engine NeuronCore-v2 GPSIMD Engine NeuronLink-v2 On-chip SRAM memory Tensor Engine Vector Engine Scalar Engine Trainium NeuronLink-v2 NeuronLink-v2 https://aws.amazon.com/machine-learning/trainium/ NeuronLink-v2 高帯域、低遅延なデバイス間接続 32 GB HBM メモリスタック専用 Collective Compute エンジン分散学習、推論を行う際に、演算と集団通信をオーバーラップ

rights reserved. AWS Trainium HBM HBM NeuronCore-v2 GPSIMD Engine NeuronLink-v2 DMA DMA DMA DMA DMA DMA DMA Collective Communication Host PCIe On-chip SRAM memory Tensor Engine Vector Engine Scalar Engine NeuronCore-v2 GPSIMD Engine NeuronLink-v2 On-chip SRAM memory Tensor Engine Vector Engine Scalar Engine Trainium NeuronLink-v2 NeuronLink-v2 2D torus topology (trn1.32xlarge)

rights reserved. AWS Trainium and Inferentia2 HBM HBM NeuronCore-v2 GPSIMD Engine NeuronLink-v2 DMA DMA DMA DMA DMA DMA DMA Collective Communication Host PCIe On-chip SRAM memory Tensor Engine Vector Engine Scalar Engine NeuronCore-v2 GPSIMD Engine NeuronLink-v2 On-chip SRAM memory Tensor Engine Vector Engine Scalar Engine Trainium NeuronLink-v2 NeuronLink-v2 HBM HBM NeuronCore-v2 GPSIMD Engine NeuronLink-v2 DMA DMA DMA DMA DMA DMA DMA Collective Communication Host PCIe On-chip SRAM memory Tensor Engine Vector Engine Scalar Engine NeuronCore-v2 GPSIMD Engine NeuronLink-v2 On-chip SRAM memory Tensor Engine Vector Engine Scalar Engine Inferentia2 Ring topology (inf2.48xlarge) 2D torus topology (trn1.32xlarge)

rights reserved. 11 AWS Trainium2 • AWS が自社開発した第３世代の生成 AI / ML アクセラレータ • 第３世代となる Neuronコア v3 を８個搭載 HBM capacity HBM bandwidth Dense compute Sparse compute

rights reserved. 12 Amazon EC2 Trn2インスタンス • AWS Trainium 2 を搭載した Amazon EC2 Trn2 インスタンスが一般利用開始に • P5e/P5nと比較して30-40%高いコストパフォーマンス • 米国東部 (オハイオ) リージョンで Capacity Blocks for ML での提供 Instance size Trainium2 chips Chip memory Chip Memory Bandwidth vCPUs Instance Memory Storage NeuronLink EFAv3 Capacity Block Price 3Yr RI Price trn2.48xlarge 16 1.5TB 46 TB/s 192 2TB 4x 1.92TB NVMe 1 TB/s 3.2 Tb/s $35.76/hr $34.39/hr H I G H P E R F O R M A N C E training and inference of trillion+ parameter Generative AI models B E S T P R I C E - P E R F for generative AI and deep learning on AWS U P T O 4 6 T B / s of HBM Bandwidth, ideal for memory intensive token generation *2025年4月時点の米国東部 (オハイオ)の価格

rights reserved. AWS Trainium & Trainium2 HBM HBM NeuronCore-v2 GPSIMD Engine NeuronLink-v2 DMA DMA DMA DMA DMA DMA DMA Collective Communication Host PCIe On-chip SRAM memory Tensor Engine Vector Engine Scalar Engine NeuronCore-v2 GPSIMD Engine NeuronLink-v2 On-chip SRAM memory Tensor Engine Vector Engine Scalar Engine Trainium NeuronLink-v2 NeuronLink-v2 HBM HBM NeuronLink-v3 DMA DMA DMA DMA DMA DMA DMA Collective Communication Host PCIe NeuronLink-v3 Trainium2 NeuronLink-v3 NeuronLink-v3 HBM HBM NeuronCore-v3 GPSIMD Engine On-chip SRAM memory Tensor Engine Vector Engine Scalar Engine NeuronCore-v3 GPSIMD Engine On-chip SRAM memory Tensor Engine Vector Engine Scalar Engine NeuronCore-v3 GPSIMD Engine On-chip SRAM memory Tensor Engine Vector Engine Scalar Engine NeuronCore-v3 GPSIMD Engine On-chip SRAM memory Tensor Engine Vector Engine Scalar Engine

rights reserved. AWS Trainium2 第 3 世代 Neuronコア v3 - 8つの Neuronコア - 4つの論理 Neuronコアとして利用可能 - インスタンスあたり 128 Neuronコア（64 論理 Neuronコア） - Sparsity対応 96 GB (4 x 24GB) HBM メモリスタック - Trainium と比較し 3倍の容量、4倍の帯域 NeuronLink-v3 - より高帯域、低遅延なデバイス間接続 - Trainium 同様の 2D torus topology HBM HBM NeuronLink-v3 DMA DMA DMA DMA DMA DMA DMA Collective Communication Host PCIe NeuronLink-v3 Trainium2 NeuronLink-v3 NeuronLink-v3 HBM HBM NeuronCore-v3 GPSIMD Engine On-chip SRAM memory Tensor Engine Vector Engine Scalar Engine NeuronCore-v3 GPSIMD Engine On-chip SRAM memory Tensor Engine Vector Engine Scalar Engine NeuronCore-v3 GPSIMD Engine On-chip SRAM memory Tensor Engine Vector Engine Scalar Engine NeuronCore-v3 GPSIMD Engine On-chip SRAM memory Tensor Engine Vector Engine Scalar Engine

rights reserved. N e u r o n R u n t i m e N e u r o n C o m p i l e r Neuron SDK ソフトウェアスタック D e v e l o p e r To o l s Health Monitoring & Observability Neuron Profiler N x D Tr a i n i n g N x D I n f e r e n c e PyTorch PyTorch/XLA P J R T Libnrt Neuron Driver Libccom Libfabric EFA Driver JAX AXLearn NKI Kernels Ops Fusion Compute and Memory Optimizations

rights reserved. LLM 分散学習向けライブラリ NxD (NeuronX Distributed) Training • AWS が Trn1/Trn2 インスタンス向けに開発した分散学習ライブラリ • 分散学習時に利用される各種テクニックをサポート • Tensor/Pipeline/Data 並列、ZeRO-1 Optimizer、Activation Memory Reduction、Sequence Parallelism • 各種Data Type対応、Async checkpoint saving • NKI – Neuron Kernel Interface • 異なる抽象度でトレーニングスクリプトにアクセス • 高レベルのYAML設定ファイルを利用 (NVIDIA NeMo 同様の設定ファイルを利用) • PyTorch Lightning（PTL）API • NxD Core foundational API https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/index.html

rights reserved. 17 Llama3-70B Continual Pre-training Trn2.48xlarge では 64 論理Neuronコアを保持テンソル並列度 32 パイプライン並列度 4 Checkpoint の保存方法 https://github.com/aws-neuron/neuronx-distributed-training/

rights reserved. LLM 分散推論向けライブラリ NxD (NeuronX Distributed) Inference • Inferentia、Trainiumインスタンス上で LLM をデプロイするための推論ライブラリ • Llama 2、Llama 3.x (マルチモーダル含む)、Mixtral や DBRX など Mixture-of-Experts（MoE）モデルアーキテクチャに対応 • 分散推論時に利用される各種テクニックをサポート • KV Cache, Multi-Head Attention (MHA), Grouped Query Attention (GQA), Flash Attention • vLLM に対応 • 量子化、speculative decoding に対応 • NKI – Neuron Kernel Interface • Llama3.1 405B を、trn2.48xlage 単一ノード上にデプロイ可能。trn1.32xlagreによるマルチノード分散にも対応 https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/index.html

rights reserved. 19 Llama-3.2-90B-Vision-Instruct on Trainium • Llama 3.2-Vision 90B モデルを trn1.32xlarge 上でvLLMを使用してサービング https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/llama3.2-multimodal-tutorial.html

rights reserved. Neuron コンパイラグラフ最適化 (ハードウェア非依存) ハードウェア固有命令へのマッピング z = matmul_128x128(x,y) スケジューリング、アロケーション最適化 (ワーキングセットの最小化, レイテンシの隠蔽) ループ最適化 (レイアウト、タイリング、ベクトル化,等) • 複数のIR (中間表現)レイヤーが存在 • フロントエンドでは特定のハードウェアには依存しない最適化を実施 • NKI は最適化された中間 IRとして、バックエンドに直接実装 NKI をここに実装 HWには依存しない最適化 HWに依存した最適化

rights reserved. • Neuron デバイスを直接プログラミングするためのベアメタル言語とコンパイラ • Python ベースでのプログラミング • 馴染み深い NumPy スタイルの実装手法 • タイルベースのプログラミングモデル • PyTorch、JAX を利用、または NumPy でのベアメタル実装が可能 • 実装例：Flash Attention で 2.5x 性能向上 Neuron Kernel Interface (NKI) NKI Documentation - https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/ NKI-Samples on GitHub - https://github.com/aws-neuron/nki-samples

rights reserved. Neuron 2.22 • 4月3日 Neuron 2.22 がリリースしました。Neuron コンパイラ、各種ライブラリ、ドライバ上での性能最適化に加えて、機能的な主要アップデートは以下の通りです。 • 推論ワークロード • Llama-3.2-11B マルチモーダルモデルに対応 • マルチ LoRA サービングに対応 • 量子化機能の拡張 • 学習ワークロード • LoRA教師あり(supervised)ファインチューニングに対応 https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/index.html

rights reserved. Neuron 2.22 対応モデルについて • 正式対応モデル • AWS Neuron 開発チームにて性能最適化や QA テスト等を完了し、チュートリアルまたはサンプルスクリプトを提供しているモデル • https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/models/index.html • 動作確認が取れているモデル • Qwen、Gemma、Llava、Whisper 等、Neuron では正式対応していないものの、 AWS内、Neuron開発チーム以外で動作確認が取れているモデル。 • 個別にご相談下さい。

rights reserved. 24 Amazon EC2 Trn2 UltraServers をプレビュー開始 • Trn2 インスタンス４台分、Trainium2 64チップを広帯域低遅延 NeuronLink-v3 で接続 • ML 向け EC2サーバーとして最高性能を達成 SPARSE COMPUTE DENSE COMPUTE HBM BANDWIDTH NEURONLINK BANDWIDTH EFAv3 BANDWIDTH HBM CAPACITY

rights reserved. 25 AWS Trainium3 の開発を発表 • 2025年内の Trainium 3 登場をアナウンス • AWS 初の 3nm プロセス採用 • Trainium2 の 2倍の性能 • 40%の電力効率の向上

rights reserved. Build on Trainium call for proposals - Spring 2025 • 大学向けAI研究・教育支援プログラム • 総額1.1億ドル相当を助成 • AWS Trainium リサーチクラスタの提供 • 最大4万 Trainium を利用可能 • 今期申請期間 • 2025年3月19日～ 4月30日 https://aws.amazon.com/ai/machine-learning/trainium/research/ 詳細はこちらから 26

rights reserved. 27 コミュニティへの参加はこちらから AWS Neuron Community が発足しました！

AWS Trainium/Inferentia/Neuron SDK 最新動向2025春

AWS Trainium/Inferentia/Neuron SDK 最新動向2025春

Hiroshi Tokoyo

More Decks by Hiroshi Tokoyo

Featured

Transcript

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All

© 2025, Amazon Web Services, Inc. or its affiliates. All