DeepOps – An efficient way to deploy GPU cluster for computing

DeepOps An Ef fi cient Way To Deploy GPU Cluster
for Computing Frank Lin @yylin1 GARAOTUS HPC Summit 2021 | NOVEMBER 18, 2021

About Me 林義洋 Frank Li n • Co-organizer of Cloud
Native Taiwan User Grou p • Interested in emerging technologies 2 GitHub: yylin1([email protected]) Blog: https://blog.yylin.io

Cloud Native Taiwan User Group Facebook

Agenda • Manage resources more effectively • What is DeepOps?
• How to choose Job Schedule Management ? 4

5 Manage Resources More Effectively 如何更有效應⽤ GPU 環境資源？

6 Why Clusters? Ref: GTC 2020 - AI as a
Service (AIaaS): Zero to Kubeflow: Supporting Your Data Science Teams with the Most Common Uses Case s https://developer.nvidia.com/gtc/2020/video/s22040-vid CPU CPU CPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU (Excel, Internal Page) (Platform)

IT維運團隊 / 資料科學家 IT Admin Data Engineer Data Scientists 如何有效解決環境問題？

8 AI Platform Considerations Ref: GTC Silicon Valley-2019: Building and
managing scalable AI infrastructure with NVIDIA DGX POD and DGX Pod Management software  https://developer.nvidia.com/gtc/2019/video/s9334

The power of one platform • 易於部署執⾏從開發到⽣產環境的⼯作流程 • 單⼀個管理平台下，能⽀援多個團隊或專案項⽬ •
有效管理和監控叢集資源 (GPUs)、計費和使⽤狀況 • 如何讓每個團隊專注他們擅長的事情，排除額外的成本 • Data Scientists -> Data Scienc e • App Developer -> Implement App s • DevOps Engineers -> DevOps / AI WorkFlow

10 What is DeepOps ? Tools for building GPU clusters

What is DeepOps? DeepOps 開源專案主要⽤於快速佈署 Kubernetes 與 Slurm，在⼤規模 GPU 伺服器
叢集和共享單節點部署（例如 NVIDIA DGX Systems）的理想選擇 • 具⾼彈性，可以進⾏調整或以模塊化⽅式使⽤，以匹配特定使⽤情境的集群需求 • 利⽤ Ansible 提供點到點來設置整個群集管理堆棧 • 可部署 Slurm、Kubernetes 或兩者的混合 • 提供腳本可⽤於現有叢集快速部署 Kubeflow 和連接 NFS 存儲 • GitHub: https://github.com/NVIDIA/deepops

Building out your GPU cluster Ref: GTC Silicon Valley-2019: Building
and managing scalable AI infrastructure with NVIDIA DGX POD and DGX Pod Management software  https://developer.nvidia.com/gtc/2019/video/s9334

Installing & managing an AI as a Servicer cluster DeepOps
Components Scheduler Ref: GTC Digital October 2020: GPU-Accelerated Labs on Demand Using Virtual Computer Serve r https://www.nvidia.com/en-us/on-demand/session/gtcfall20-a21371/

Automation: Ansible • Open-source automation and configuration management too l
• Agentless (nothing to install on target nodes ) • Easier to maintain & scale than custom script s • Playbooks use YAML: easy to learn and read Ref: https://www.ansible.com/

Deploying DeepOps

Building Multi-node GPU Clusters with DeepOps

Source: https://developer.nvidia.com/blog/deploying-rich-cluster-api-on-dgx-for-multi-user-sharing/ Architecture

18 How to choose Job Schedule Management ? Different problems
yield different solutions

• Basic scheduling features ( kube-batch ) • Share nodes,
schedule jobs for GPUs on a node ( Excel spreadsheet) • Advanced container and workflow orchestration (e.g., with kubeflow ) • Covers data permissions and security (LDAP, file permissions ) • Adds analytics and monitoring (important also for justification of purchase) • Advanced scheduling features (batch / gang scheduling) • Works also without containers (no runtime overhead) • Multi-node job s • Job dependencies, workflows, DAGs • Advanced reservation s • Intelligent scheduling (not just FIFO ) • User accountin g • Other HPC-like scheduling functionality Schedulers Comparison Ref: GTC Silicon Valley-2019: Building and managing scalable AI infrastructure with NVIDIA DGX POD and DGX Pod Management software  https://developer.nvidia.com/gtc/2019/video/s9334

AI Workflow - 選擇平台適⽤性以較快的⽅式擺脫「資料科學家/開發者」平台設定與較⾼學習曲線，團隊更有使⽤ GPU 資源 HPC ( Slurm
+ Container ) • 與開發者環境⼀致，技術⾨檻比較低沒有任何開發差異 • Batch Job 排程管理的⼯作佇列 (Queue) 來仲裁資源分配 • 節點計算資源 bare metal / container runtime (e.g., Singularity, Nvidia ENROOT ) Application Containerization ( Kubernetes ) • 長時間運⾏的 AI Service ⾼可⽤性 • 平台提供更多模組整合，協助 AI Workflow 任務相關⼯具，監控作業與 GPU 資源 • 需針對應⽤情境調整 Job Schedule

Slurm Example: Interactive one-shot job

總結 • DeepOps 簡化繁瑣部署各種問題，是⼀個 Nvidia 開源專案協助加速 GPU 叢集建置的⼯具 • Slurm
提供更多的 Scheduling Algorithms，讓 HPC 使⽤者資料科學家，能更有效執⾏對應 Jo b • 需要搭配 module load system 配置 GPU 環境依賴資源 • Kubernetes 對於 Batch System / AI 應⽤場景, 例如 TensorFlow, Spark, PyTorch, MP I • 需要搭配其他 Job Scheduler (kube-batch, Volcano) 來提升執⾏策略 • 平台選擇依據現有開發環境，選擇對開發者適⽤性最⾼為最佳 • 技術⾨檻 (e.g., 學習 YAML 撰寫, Disk 掛載問題)

23 THANK YOU! GARAOTUS HPC Summit 2021 | NOVEMBER 18,
2021

DeepOps – An efficient way to deploy GPU cluste...

DeepOps – An efficient way to deploy GPU cluster for computing

Frank Lin

More Decks by Frank Lin

Other Decks in Technology

Featured

Transcript