Kubecon NA 2025: DRA 関連の Recap と社内 GPU 基盤での課題

Kubecon NA 2025   DRA 関連の Recap と   社内
GPU 基盤での課題   CyberAgent ML Platform Team 南波陽平     1

自己紹介 2 南波陽平 (Yohei Kevin Namba) • CyberAgent 2024 新卒
◦ CyberAgent group Infrastructure Unit • ML Platform Team 所属 ◦ GPUサーバーや Kubernetes の運用 ◦ サーバレス推論やジョブ基盤の開発 ◦ プライベートクラウド用 API/SDK/CLI などの開発 • アカウント ◦ X: kevin_namba ◦ LinkedIn/GitHub: kevin-namba

Kubecon NA 2025 所感 3 • Kubernetes 1.34 で DRA
が GA に ◦ 数多くの DRA 関連セッションがあった • KServe が Incubating Project に ◦ 弊社でも推論基盤で利用しているので動向に注目 ▪ llm-d や Envoy AI Gateway 統合など • AI × GitOps や AI × Security など

Kubecon NA 2025 所感 4 • 弊社 GPU 基盤でも DRA
で解決できそうな課題が多くある ◦ 4つのセッションを Recap させていただきます ▪ DRA is GA! Kubernetes WG Device Management - GPUs, TPUs, NICs and More With DRA - Kevin Klues, NVIDIA & Patrick Ohly, Intel ▪ Achieving Peak Performance Through Hardware Alignment in DRA - Gaurav Ghildiyal, Google & Byonggon Chun, Fluidstack ▪ Partitionable Devices: Putting the “Dynamic” Back in Dynamic Resource Allocation - Morten Jæger Torkildsen, Google & Jan-Philip Gehrcke, NVIDIA ▪ Share With Care: Efficient Device Sharing With Guaranteed Resources Using DRA - Sunyanan Choochotkaew, IBM Research & John Belamaric, Google

5 DRA 概要

Dynamic Resource Allocation (DRA) in Four Parts 6 Part 1:
New Kubernetes API to describe devices (ResourceSlice): This device is an nvidia.com/gpu, its product ID is A100-SXM4-40GB, it has 40Gi of memory, and 3456 FP64 cores. Part 2: New Kubernetes API to request devices (ResourceClaim): I need an nvidia.com/gpu with at least 30Gi of memory and 3000 FP64 cores. Part 3: Updated scheduler to match requests to devices. Part 4: New Kubelet API to actuate the scheduler’s decisions. 出典: DRA is GA! Kubernetes WG Device Management - GPUs, TPUs, NICs and More With DRA - Kevin Klues, NVIDIA & Patrick Ohly, Intel 出典: DRA is GA! Kubernetes WG Device Management - GPUs, TPUs, NICs and More With DRA - Kevin Klues, NVIDIA & Patrick Ohly, Intel

1.34 で追加された内容 7 • Extended Resource Requests via DRA ▪
ノードは Device Plugin または DRA Driver のいずれかを利用 ▪ DRA への段階的な移行が可能 (Device Plugin と共存可能 ) • Binding Conditions ◦ DRA Driver が割り当てるまで Pod のスケジューリングを遅延 ◦ デバイスをノードにアタッチする用途で利用可能 • Consumable Capacity ◦ 1 つのデバイスを複数の Pod で共有して割り当て可能 ◦ デバイスのキャパシティは、要求したリソース量に応じて消費出典: DRA is GA! Kubernetes WG Device Management - GPUs, TPUs, NICs and More With DRA - Kevin Klues, NVIDIA & Patrick Ohly, Intel

What’s next in DRA 8 出典: DRA is GA! Kubernetes
WG Device Management - GPUs, TPUs, NICs and More With DRA - Kevin Klues, NVIDIA & Patrick Ohly, Intel

DRA 関連の動き 9 • DRANET ◦ https://dranet.dev/ • DRA Driver
for CPUs ◦ https://github.com/kubernetes-sigs/dra-driver-cpu • KubeVirt Integration ◦ https://github.com/kubevirt/enhancements/issues/10 • Kueue Integration ◦ https://github.com/kubernetes-sigs/kueue/issues/2941 出典: DRA is GA! Kubernetes WG Device Management - GPUs, TPUs, NICs and More With DRA - Kevin Klues, NVIDIA & Patrick Ohly, Intel

10 社内 GPU 基盤における課題 1 ※Recap がメインのため、課題感の大小や実現可能性については触れません

CyberAgent の GPU 基盤 • 多種多様のオンプレミス GPU on Kubernetes ◦
L4/H100/A100 etc… ◦ user namespace 数: 200+ ( Kueue を導入中⏩) → 在庫状況によっては、 GPU の分割やマルチノードの需要 ⬆ • マネージドサービス ◦ JupyterLab (Notebook) 環境 ◦ 推論基盤 ( kserve/kserve を利用) ◦ 学習基盤 ( kubeflow を利用) ◦ 分散学習基盤 ( kubeflow/mpi-operator を利用)

CyberAgent の GPU 基盤 12 • 分散学習基盤 ◦ 400Gフルバイセクション (RoCEv2)
▪ Rail-optimized Topology ▪ Full Bisection Bandwidth ▪ Adaptive Routing ▪ 800G環境も増設中 ◦ SR-IOV Plugin + Multus で Pod にアタッチ AI/ML基盤の400G DCネットワークを構築した話大規模な分散機械学習を支える NVIDIA H100 Kubernetes クラスタとそのエコシステム AI/ML基盤における800GbE スイッチ導入とその挑戦

社内 GPU 基盤における課題 13 • シングルノード内の通信パフォーマンス ◦ 混在しているため nvlink
非搭載のノードもある → GPU 間 / GPU-CPU 間の HW Alignment 次第でパフォーマンス ⤵

社内 GPU 基盤における課題 14 • 複数人でマルチノード学習する際の通信パフォーマンス ◦ ノード内の NIC-GPU
の対応が管理できない ◦ Pod のノード配置次第で特定リンクがボトルネックになる

15 Recap: Achieving Peak Performance Through Hardware Alignment in DRA
- Gaurav Ghildiyal, Google & Byonggon Chun, Fluidstack

Recap: Achieving Peak Performance Through Hardware Alignment in DRA 16

• Misaligned(最悪) ◦ Pod が適当に NIC と GPU を握る ▪ 違う NUMA Node や PCIe Locality に跨ってしまう

• Aligned(理想) ◦ 各々の Pod が同じ PCIe Locality の GPU と NIC を握る

• Scheduler ◦ 最適なコロケーションができない場合でも、ノードを選択することがある • Kubelet ◦ 最適でなくても、空いているリソースを割り当てる → 「同一の PCIe Locality の GPU と NIC をくれ」とは言えなかった従来

k8s-dra-driver-gpu + DRANET

評価 Worst Best

評価

GPU-CPU 間 Aligned GPU-CPU 間 Misaligned

• Node Topology Constraints ◦ ラック間の通信を避けた Pod のノード配置も可能に

29 社内 GPU 基盤における課題 2

社内 GPU 基盤における課題 2 30 • 在庫問題 ◦ 全ての学習や推論が多くの GPU
Memory を使うとは限らない ◦ 小さなサイズの GPU の在庫があるとは限らない • MIG ◦ GPU の演算機/メモリ等を HW レベルで分離し、 QoS と障害を分離 • MIG × Device Plugin の問題 ◦ 在庫状況に応じて分割数を変化させることができない ◦ 常時 MIG を有効化すると効率が下がる

31 Recap: Partitionable Devices: Putting the “Dynamic” Back in Dynamic
Resource Allocation - Morten Jæger Torkildsen, Google & Jan-Philip Gehrcke, NVIDIA

Recap: Partitionable Devices: Putting the “Dynamic” Back in Dynamic Resource
Allocation 32 • Device Plugin で MIG を使う場合 ◦ フル GPU のみ割り当て可能 ◦ 分割数を固定して Kubernetes に1つずつ見せる必要がある → 分割数の動的変化や On/Off の切り替えは難しい

33 • Partitionable Devices ◦ CounterSet ▪ 物理GPU全体（リソースプール） ◦ Counter
▪ 制限付きリソース • メモリ • コア数 Recap: Partitionable Devices: Putting the “Dynamic” Back in Dynamic Resource Allocation

34 • Device ◦ Counterを消費する論理デバイス（ MIG インスタンスなど） Recap: Partitionable Devices:
Putting the “Dynamic” Back in Dynamic Resource Allocation

35 • Device が割り当てられると、 Counter を消費 Recap: Partitionable Devices: Putting
the “Dynamic” Back in Dynamic Resource Allocation

36 • 残 Counter を超える Device は Schedule できない Recap:
Partitionable Devices: Putting the “Dynamic” Back in Dynamic Resource Allocation

37 • NVIDIA DRA Driver (https://github.com/NVIDIA/k8s-dra-driver-gpu) ◦ NVIDIAが実装中の GPU向けDRAドライバ ▪
go-nvml • MIG の有効化/無効化や MIG インスタンスの作成 /削除 ▪ 処理の流れ • Job開始 → MIG有効化 ◦ 必要なMIGインスタンスを作成 • Job終了 → MIG削除 → 通常GPUに戻す Recap: Partitionable Devices: Putting the “Dynamic” Back in Dynamic Resource Allocation

38 • DRA で MIG を使う場合 ◦ Partitional Devices (v1.36でbeta
予定） ▪ 動的MIG分割が可能に ◦ DRA driver ▪ 割り当て時に初めて MIG を有効化にする Recap: Partitionable Devices: Putting the “Dynamic” Back in Dynamic Resource Allocation

39 社内 GPU 基盤における課題 3

社内 GPU 基盤における課題 3 40 • ネットワーク帯域のノイジーネイバー ◦ 学習 Job
はデータダウンロードの際に帯域を食う ◦ (非InterConnectの)マルチノード学習 /推論のノード間通信 • 現状の対策 ◦ Pod annotation で外部通信を制限 ◦ Pod 間通信や外部通信も kubernetes のシステム間通信も同一 NW → 帯域を食う複数 Pod が同じノード (NIC)に Schedule されるリスク

41 Recap: Share With Care: Efficient Device Sharing With Guaranteed
Resources Using DRA

42 • 従来のネットワーク制御 ◦ Pod 毎の帯域制御は可能 ◦ Scheduler は帯域について知る由がない Recap:
Share With Care: Efficient Device Sharing With Guaranteed Resources Using DRA

43 • Consumable Capacity ◦ 同一デバイスの複数回割当を許可（容量内に限る） ◦ 1つの Claim 内で同一デバイスの重複割当を防止
◦ 容量超過を防ぎ、保証された割当を実現 Recap: Share With Care: Efficient Device Sharing With Guaranteed Resources Using DRA

Resources Using DRA

47 まとめ

まとめ 48 • 1.34 で DRA が GA に ◦
ノード内の HW Alignment を考慮したスケジュールが可能に ◦ MIG を柔軟に使えるように ◦ ネットワーク容量を考えたスケジュールが可能に • 導入の際の懸念も解消されつつある ◦ Device Plugin との共存 ◦ Kueue Integration • ぜひ導入したい

参考にさせていただいたセッション 49 • DRA is GA! Kubernetes WG Device Management
- GPUs, TPUs, NICs and More With DRA - Kevin Klues, NVIDIA & Patrick Ohly, Intel ◦ https://sched.co/27Nu0 • Achieving Peak Performance Through Hardware Alignment in DRA - Gaurav Ghildiyal, Google & Byonggon Chun, Fluidstack ◦ https://sched.co/27Fds • Partitionable Devices: Putting the “Dynamic” Back in Dynamic Resource Allocation - Morten Jæger Torkildsen, Google & Jan-Philip Gehrcke, NVIDIA ◦ https://sched.co/27FbY • Share With Care: Efficient Device Sharing With Guaranteed Resources Using DRA - Sunyanan Choochotkaew, IBM Research & John Belamaric, Google ◦ https://sched.co/27FWB

50 Appendix

Kubecon NA 2025 の DRA 関連セッション (1/2) 51 • Unlocking
Performance: Topology-Aware CPU Scheduling With a DRA Driver - Praveen Krishna, Google & Marlow Warnicke (Weston), SchedMD LLC ◦ https://sched.co/28aDA • Kubernetes for Multi-Host Training and Inference: Workload Aware Scheduling - Eric Tune & Dominik Marcinski, Google ◦ https://sched.co/28aDJ • Fit-to-Serve: How a New DRA Capability for Dynamic Device Sharing Fits Into Distributed LLM Serving - Sunyanan Choochotkaew & Tatsuhiro Chiba, IBM Research ◦ https://sched.co/28D3U • DRA is GA! Kubernetes WG Device Management - GPUs, TPUs, NICs and More With DRA - Kevin Klues, NVIDIA & Patrick Ohly, Intel ◦ https://sched.co/27Nu0 • Navigating the AI/ML Networking Maze in Kubernetes: Lessons From the Trenches - Antonio Ojea, Google ◦ https://sched.co/27FZW • Share With Care: Efficient Device Sharing With Guaranteed Resources Using DRA - Sunyanan Choochotkaew, IBM Research & John Belamaric, Google ◦ https://sched.co/27FWB

Kubecon NA 2025 の DRA 関連セッション (2/2) 52 • Sponsored
Demo: The DRA Paradigm Shift: Request a Capability, Not a Node ◦ https://sched.co/2A7Dq • 📚 Tutorial: Unlock the Future of Kubernetes and Accelerators With Dynamic Resource Allocation (DRA) - Rey Lejano, Red Hat ◦ https://sched.co/27FbG • ⚡ Lightning Talk: Getting (and Staying) up To Speed on DRA With the DRA Example Driver - Jon Huhn, Microsoft ◦ https://sched.co/27Fbq • Keynote: The Community-Driven Evolution of the Kubernetes Network Driver - Lionel Jouin, Software Engineer, Red Hat & Antonio Ojea, Staff Software Engineer, Google ◦ https://sched.co/27FYh • Achieving Peak Performance Through Hardware Alignment in DRA - Gaurav Ghildiyal, Google & Byonggon Chun, Fluidstack ◦ https://sched.co/27Fds • Partitionable Devices: Putting the “Dynamic” Back in Dynamic Resource Allocation - Morten Jæger Torkildsen, Google & Jan-Philip Gehrcke, NVIDIA ◦ https://sched.co/27FbY

Kubecon NA 2025: DRA 関連の Recap と社内 GPU 基盤での課題

Kubecon NA 2025: DRA 関連の Recap と社内 GPU 基盤での課題

Other Decks in Technology

Featured

Transcript