Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DeepOps – An efficient way to deploy GPU cluste...

Frank Lin
November 30, 2021

DeepOps – An efficient way to deploy GPU cluster for computing

HPC Summit | 2021 APAC
https://www.garaotus.com/HPC2021/vod/

Frank Lin

November 30, 2021
Tweet

More Decks by Frank Lin

Other Decks in Technology

Transcript

  1. DeepOps An Ef fi cient Way To Deploy GPU Cluster

    for Computing Frank Lin @yylin1 GARAOTUS HPC Summit 2021 | NOVEMBER 18, 2021
  2. About Me 林義洋 Frank Li n • Co-organizer of Cloud

    Native Taiwan User Grou p • Interested in emerging technologies 2 GitHub: yylin1([email protected]) Blog: https://blog.yylin.io
  3. Agenda • Manage resources more effectively • What is DeepOps?

    • How to choose Job Schedule Management ? 4
  4. 6 Why Clusters? Ref: GTC 2020 - AI as a

    Service (AIaaS): Zero to Kubeflow: Supporting Your Data Science Teams with the Most Common Uses Case s https://developer.nvidia.com/gtc/2020/video/s22040-vid CPU CPU CPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU (Excel, Internal Page) (Platform)
  5. 8 AI Platform Considerations Ref: GTC Silicon Valley-2019: Building and

    managing scalable AI infrastructure with NVIDIA DGX POD and DGX Pod Management software
 https://developer.nvidia.com/gtc/2019/video/s9334
  6. The power of one platform • 易於部署執⾏從開發到⽣產環境的⼯作流程 • 單⼀個管理平台下,能⽀援多個團隊或專案項⽬ •

    有效管理和監控叢集資源 (GPUs)、計費和使⽤狀況 • 如何讓每個團隊專注他們擅長的事情,排除額外的成本 • Data Scientists -> Data Scienc e • App Developer -> Implement App s • DevOps Engineers -> DevOps / AI WorkFlow
  7. What is DeepOps? DeepOps 開源專案主要⽤於快速佈署 Kubernetes 與 Slurm,在⼤規模 GPU 伺服器

    叢集和共享單節點部署(例如 NVIDIA DGX Systems)的理想選擇 • 具⾼彈性,可以進⾏調整或以模塊化⽅式使⽤,以匹配特定使⽤情境的集群需求 • 利⽤ Ansible 提供點到點來設置整個群集管理堆棧 • 可部署 Slurm、Kubernetes 或兩者的混合 • 提供腳本可⽤於現有叢集快速部署 Kubeflow 和連接 NFS 存儲 • GitHub: https://github.com/NVIDIA/deepops
  8. Building out your GPU cluster Ref: GTC Silicon Valley-2019: Building

    and managing scalable AI infrastructure with NVIDIA DGX POD and DGX Pod Management software
 https://developer.nvidia.com/gtc/2019/video/s9334
  9. Installing & managing an AI as a Servicer cluster DeepOps

    Components Scheduler Ref: GTC Digital October 2020: GPU-Accelerated Labs on Demand Using Virtual Computer Serve r https://www.nvidia.com/en-us/on-demand/session/gtcfall20-a21371/
  10. Automation: Ansible • Open-source automation and configuration management too l

    • Agentless (nothing to install on target nodes ) • Easier to maintain & scale than custom script s • Playbooks use YAML: easy to learn and read Ref: https://www.ansible.com/
  11. • Basic scheduling features ( kube-batch ) • Share nodes,

    schedule jobs for GPUs on a node ( Excel spreadsheet) • Advanced container and workflow orchestration (e.g., with kubeflow ) • Covers data permissions and security (LDAP, file permissions ) • Adds analytics and monitoring (important also for justification of purchase) • Advanced scheduling features (batch / gang scheduling) • Works also without containers (no runtime overhead) • Multi-node job s • Job dependencies, workflows, DAGs • Advanced reservation s • Intelligent scheduling (not just FIFO ) • User accountin g • Other HPC-like scheduling functionality Schedulers Comparison Ref: GTC Silicon Valley-2019: Building and managing scalable AI infrastructure with NVIDIA DGX POD and DGX Pod Management software
 https://developer.nvidia.com/gtc/2019/video/s9334
  12. AI Workflow - 選擇平台適⽤性 以較快的⽅式擺脫「資料科學家/開發者」平台設定與較⾼學習曲線,團隊更有使⽤ GPU 資源 HPC ( Slurm

    + Container ) • 與開發者環境⼀致,技術⾨檻比較低沒有任何開發差異 • Batch Job 排程管理的⼯作佇列 (Queue) 來仲裁資源分配 • 節點計算資源 bare metal / container runtime (e.g., Singularity, Nvidia ENROOT ) Application Containerization ( Kubernetes ) • 長時間運⾏的 AI Service ⾼可⽤性 • 平台提供更多模組整合,協助 AI Workflow 任務相關⼯具,監控作業與 GPU 資源 • 需針對應⽤情境調整 Job Schedule
  13. 總結 • DeepOps 簡化繁瑣部署各種問題,是⼀個 Nvidia 開源專案協助加速 GPU 叢集建置的⼯具 • Slurm

    提供更多的 Scheduling Algorithms,讓 HPC 使⽤者資料科學家,能更有效執⾏對應 Jo b • 需要搭配 module load system 配置 GPU 環境依賴資源 • Kubernetes 對於 Batch System / AI 應⽤場景, 例如 TensorFlow, Spark, PyTorch, MP I • 需要搭配其他 Job Scheduler (kube-batch, Volcano) 來提升執⾏策略 • 平台選擇依據現有開發環境,選擇對開發者適⽤性最⾼為最佳 • 技術⾨檻 (e.g., 學習 YAML 撰寫, Disk 掛載問題)