Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kubernetes-based GPU as a Service Platform by u...

Kubernetes-based GPU as a Service Platform by using Open Source Software [GTC 2020]

アプリケーション開発において、アジリティを高めることでビジネスにより大きなインパクトを与えることはごくごく一般的になってきました。
一方で、アプリケーションが動作するプラットフォームは後回しになっていませんか?
アプリケーションにとって最適な実行環境、ひいては開発者により良い環境を提供していかなければ、ハードウェアを最大限活用できないのは明白です。
そのためにはプラットフォームも継続的に発展させる必要があります。
では、プラットフォームを進化させていくのはどうすればいいでしょうか。
サイバーエージェントでは、トレンドにマッチしたOSSを積極的に利用することで、開発のアジリティを高め、継続的な発展を実現しています。
本セッションではKubernetesをはじめとしたOSSを利用し、柔軟に進化し続けるプラットフォームを、NVIDIAのDGX A100の上に構築・提供している事例について紹介します。

Daisuke Takahashi

October 09, 2020
Tweet

More Decks by Daisuke Takahashi

Other Decks in Programming

Transcript

  1. Who are we? AI Category Owner Lee joined CyberAgent in

    2016. Contributing to improving in-house products as Solution Architects and platform development (e.g., our OpenStack and container service). Lee is also developing an AI platform as an AI ​​category owner. Lee Yeongjae Masaya Aoyama Shuichiro Makigaki Daisuke Takahashi K8s aaS Product Owner Implemented GKE-like Kubernetes as a service on private cloud as product owner and supported the "Developer Experts" for Kubernetes projects at CyberAgent. Co-chair of the largest Cloud Native conference in Japan. ML/Backend Engineer Joined CyberAgent in 2016. Mainly works for in-house system development as backend engineer and architect. He also works in platform development (OpenStack and container service) and is developing an AI platform. Infrastructure Engineer Mainly responsible for development of private OpenStack platform and Kubernetes-as-a-Service as well as effective utilization of various accelerator devices. Building underlying physical infrastructures for GPUaaS/AI platform.
  2. Agenda 1. Overview of CyberAgent, Inc. 2. Why we decided

    to use an on-premise environment 3. Kubernetes-based GPU-as-a-Service Platform 4. AI Platform 5. Physical layer around GPU 6. Conclusion
  3. “To create the 21st century’s leading company” Media A variety

    of media services enjoyed by countless people ➔ AbemaTV ➔ AWA ➔ WinTicket Advertisement Offering comprehensive advertising solutions from agency business to ad technologies ➔ Dynalyst ➔ CA Wise ➔ AIR TRACK Game Developing 50+ smartphone games (including eight major titles on various platforms) ➔ GRANBLUE FANTASY ➔ PRINCESS CONNECT! Re:Dive ➔ Shadowverse 3 Main Segments ※「ABEMA」:© Abema TV, Inc. ※※「GRANBLUE FANTASY」、 「PRINCESS CONNECT! Re:Dive」: © Cygames, Inc.
  4. Agenda 1. Overview of CyberAgent, Inc. 2. Why we decided

    to use an on-premise environment 3. Kubernetes-based GPU-as-a-Service Platform 4. AI Platform 5. Physical layer around GPU 6. Conclusion
  5. Why AI solution for advertising? • To reduce the time

    needed to create effective ads and domain knowledge of customer's business • To discover new highly effective ad creatives • To predict the performance of ad creatives and prioritize them by ranking • To help analyze and improve the effectiveness of ad creatives • To identify and avoid ads that cause negative reaction ※「GRANBLUE FANTASY」: © Cygames, Inc. 97 Points Similar ad detected! Creative
  6. Why GPUs? We must perform high processing volumes at high

    speed. GPU power can contribute to our business. • There is a huge number of combinations of advertising and media. • Computational complexity increases as more demographic information (e.g., region, age, and gender) is considered. • A fast learning cycle is required because advertisements change rapidly in response to changing consumer interests. • The advertising system treats bidding; thus, increased inference latency affects our business critically and the requirement is severe.
  7. Why on-premises? Functionalities • To build a flexible software stack

    • To link existing services Costs • Cloud fees remain high • Total on-premise costs will be lower in the long term
  8. Monthly cost ($) of GPU-only usage on cloud (part of

    the business segment) Why on-premises?
  9. Agenda 1. Overview of CyberAgent, Inc. 2. Why we decided

    to use an on-premise environment 3. Kubernetes-based GPU-as-a-Service Platform 4. AI Platform 5. Physical layer around GPU 6. Conclusion
  10. Provide GPU instances for users • Multiple instances • Multiple

    GPUs per instance Isolate GPUs between processes Pay out shared volumes for each tasks GPUaaS architecture overview and minimal requirements Container icons: https://icons8.jp/icons/set/video-card Computing resource pool Storage pool
  11. Container-based vs VM-based vs metal-based • Pros for container-based ◦

    Easy image packaging to run environment [cf. VM, Metal] ◦ Low overhead and short launch time [cf. VM] ◦ Environment isolation for multi-tenancy [cf. Metal] • Cons for container-based ◦ Low runtime isolation [cf. VM] ◦ Short lifecycle [cf. VM, Metal]
  12. Kubernetes Aggregate computing resources and orchestrate containers, volumes, etc. =

    aggregate GPUs and assigning to processes with volumes Computing resource pool Storage pool • Storage systems ◦ Block ◦ Shared filesystem ◦ Others
  13. Isolation for multi-tenancy Kubernetes namespace can be isolated for multi-tenancy

    NOTE: Container runtime (Docker / runC) cannot be completely isolated User A namespace User B namespace
  14. User authentication/authorization on Kubernetes • Authentication ◦ Service account for

    Kubernetes ◦ OIDC integration ◦ Cloud provider user/service account integration • Authorization ◦ Role-based access control(RBAC) ▪ CRUD specific resources only
  15. Accessing GPU instances (containers) 1. Access via Jupyter notebook from

    web browser 1. SSH-like access via kubernetes client tool $ kubectl exec -it PODNAME-0 -- bash PODNAME-0 #
  16. Why Kubernetes? For "Cloud Native“ • Resiliency • Easily managed

    • Observability • Fast updates • Others https://github.com/cncf/toc/blob/master/DEFINITION.md Methods: A. Reconciliation by Kubernetes B. Ecosystem C. Extending and customizing ⇒ Continue to improve the platform with OSS for business success Cloud Native means:
  17. A: Reconciliation loop • Automatic recover (converge) to desired state

    by many controllers ◦ Re-launch container (process) quickly ◦ Replace latest configs and credentials ◦ Reassign load balancer members Actual ReplicaSet (replicas = 3) Watch ReplicaSet Controller kind: ReplicaSet spec: replicas: 3 template: spec: containers: - image: nginx:1.16 Desired ReplicaSet
  18. B: Automate with Kubernetes ecosystem • Prometheus/Grafana ◦ Monitor GPU

    and server metrics • Cert-manager ◦ Create and update certificates with ACME • External-dns ◦ Associate IP address and hostnames • oauth2-proxy + nginx ingress ◦ OAuth2 authentication for WebUI • Others ◦ Auto scaling, templating settings, etc.
  19. C: Extending and customizing with Kubernetes 1. Implement custom controller

    with reconciliation model e.g., S3 image caching for volumes 2. Mutating container settings by webhook e.g., automatically inject credentials 3. Any status can be accessed via Kubernetes API e.g., collect usage status for billing 4. Store metadata to Kubernetes using ConfigMap or Secret e.g., a user’s container image references for web UI
  20. Why Kubernetes? For "Cloud Native" • Resiliency • Easily managed

    • Observability • Fast updates • Others https://github.com/cncf/toc/blob/master/DEFINITION.md Methods: A. Reconciliation by Kubernetes B. Ecosystem C. Extending and customizing ⇒ Continue to improve the platform with OSS for business success Cloud Native means:
  21. NVIDIA and OSS https://events.linuxfoundation.org/kubecon-cloudnativecon-europe/program/schedule/ • Kubernetes GPU device plugin •

    OSS monitoring stack at KubeCon EU 2020 presentation https://github.com/NVIDIA/k8s-device-plugin
  22. Agenda 1. Overview of CyberAgent, Inc. 2. Why we decided

    to use an on-premise environment 3. Kubernetes-based GPU-as-a-Service Platform 4. AI Platform 5. Physical layer around GPU 6. Conclusion
  23. Development speed will be reduced; thus, I want to complete

    all processing with GCP or AWS. Because it's not as easy to perform machine learning as the GCP AI ​​Platform. I don't use it because it's difficult to migrate from the public cloud. User’s voice of our GPUaaS (Multiple answers allowed) Please identify your dissatisfaction with GPUaaS.
  24. Using computational resources in the right place We should select

    what we should use. Creating cutting-edge environment for innovative products It is important to be best friends with the environment. Why on-premise AI platform? The public cloud has already provided many machine learning platforms. Why should we?
  25. Example: AI platform training in Google Cloud A service to

    train models via different customization options Supports different machine types, distributed training, hyperparameter tuning, and GPU/TPU acceleration Four simple steps: 1. Package training codes 2. Prepare job definition by YAML (with hyperparameter tuning if required) 3. Save code&YAML to Google Cloud Storage 4. Submit gcloud ai-platform jobs submit training https://cloud.google.com/ai-platform
  26. Idea: GCP AI Platform-compatible on-prem. AI Platform Ease of use

    is justification: many users, good IO interface, continuous improvement, easy to introduce, etc. Same configuration and codes • Introducing Kubeflow is reasonable • Treat GCP AI Platform Job = Kubeflow (Katib) resource • Abstract TFJob/PytorchJob/K8SJob, etc. Same commands • Implement compatible commands by kubectl plugins Remove barriers between on-premises & cloud
  27. Army knife for machine learning on Kubernetes https://www.kubeflow.org/docs/started/kubeflow-overview/ • On-prem.

    deployment • Resource usage control by Kubernetes • Hyperparameter tuning by Katib What is Kubeflow?
  28. What is Katib (in Kubeflow)? Hyperparameter tuning component Optimize Hyperparameters

    Neural Architecture Search Optimize neural network structure Multi-machine learning framework support TensorFlow, PyTorch, etc.
  29. Katib Resources Experiment Suggestion Trial Trial Trial TFJob/PytorchJob/Job Pod Worker

    Container Metrics Container Experiment Execution unit of hyperparameter tuning Contains all settings (e.g., algorithms) Suggestion Contains a hyperparameter pair according to the algorithm specified in the Experiment Trial Coordinate each hyperparameter from Suggestions Metrics Collector Katib DB
  30. Same configuration/codes/commands kubectl ai-platform jobs submit training kubectl ai-platform jobs

    list|get kubectl ai-platform jobs describe kubectl ai-platform jobs stream-logs kubectl ai-platform jobs cancel gcloud ai-platform jobs submit training gcloud ai-platform jobs list|get gcloud ai-platform jobs describe gcloud ai-platform jobs stream-logs gcloud ai-platform jobs cancel On-prem. resource Cloud resource GCP Job definition
  31. Abstract TFJob/K8SJob, etc. by Katib Experiment Treat GCP AIP Job

    as Katib Experiment Parse GCP-style Job definition on client side and convert it to Katib Experiment Transparent operation from end users User: create/delete Job = create/delete Experiment (internally) = create/delete TFJob/Pytorch Job (internally) Experiment Suggestion Trial Trial Trial TFJob/PytorchJob/Job Pod Worker Container Metrics Container Metrics Collector Katib DB kubectl plugin implementation
  32. Run job w/o hyperparameter tuning If no hyperparameter tuning section

    in Job definition, substitute it by limiting feasible space Parameters: FeasibleSpace: List: 0.02 Name: dummy ParameterType: discrete Experiment Suggestion Trial Trial Trial TFJob/PytorchJob/Job Pod Worker Container Metrics Container Metrics Collector Katib DB kubectl plugin implementation (cont.)
  33. Serving should be in the right place Private Cloud Pros:

    close to data source, suitable for private test CPU on virtual machine and NVIDIA T4 are available Public Cloud Pros: Flexibility and availability via global platform CPU+GPU and TPU is available Serving can often work using less resource than training
  34. Agenda 1. Overview of CyberAgent, Inc. 2. Why we decided

    to use an on-premise environment 3. Kubernetes-based GPU-as-a-Service Platform 4. AI Platform 5. Physical layer around GPU 6. Conclusion
  35. Workstations in MDF room (2019) • Clustering unused GeForce GTX

    1080 Tis with Kubernetes for researchers ◦ Much higher demand than expected and many requests for similar service from developers
  36. Issues w/ workstation cluster 1. Facility ◦ Poor power and

    cooling capabilities of MDF room for high-power devices ▪ e.g., annual power outage ◦ High latency connection to our datacenter network (Site-to-site VPN) ▪ Not suited for inference serving application 1. Workstation ◦ Lack of BMC/IPMI (remote management feature) on our machines ▪ Would like to maintain remotely due to COVID-19 pandemic 1. GPU ◦ Limited memory capacity of GeForce ▪ Insufficient for some workloads
  37. Infrastructure considerations (2020) 1. Location ◦ Our datacenter in Tokyo

    ▪ Sufficient power, cooling, and network capabilities 1. Hardware ◦ Rack-mount servers (with IPMI) ▪ Convenient maintenance ◦ NVIDIA data center GPUs ▪ Sufficient GPU memory We began looking for GPU-accelerated servers at the end of April
  38. NVIDIA A100/DGX A100 • Ampere architecture ◦ Notable performance improvements

    compared to “Volta” ▪ Up to 20x faster with sparsity • 3rd-gen NVLink/2nd-gen NVSwitch ◦ Seamlessly scalable up to 16 GPUs ◦ 2x faster GPU-to-GPU connection bandwidth than predecessors • Announce/Release Timing (14th May) ◦ Announced while we were making the list of candidate GPU servers ▪ Including DGX-1 and DGX-2
  39. MIG: Multi-instance GPU MIG mode in the NVIDIA Ampere architecture

    can run seven jobs in parallel on an A100 GPU (NVIDIA Blog) • Multi-tenancy ◦ For DGX A100, its 8 GPUs can be sliced into 56 GPU instances ◦ Administrators can assign right-sized GPU for each job • Guaranteed QoS ◦ All GPU instances include isolated memory and cores
  40. DGX A100 • 1 node (for now) ◦ Scale-out if

    required • Almost ready ☑ Setup (OS, Kubernetes, etc.) ☑ Benchmark ☐ Evaluate MIG support of Kubernetes device plugin
  41. Hardware around DGX A100 100GbE 25GbE Compute NVIDIA DGX A100

    Network Mellanox SN2010 Storage NetApp AFF A800
  42. Agenda 1. Overview of CyberAgent, Inc. 2. Why we decided

    to use an on-premise environment 3. Kubernetes-based GPU-as-a-Service Platform 4. AI Platform 5. Physical layer around GPU 6. Conclusion
  43. Conclusion: Purpose Why do we need GPUs? We must perform

    high processing volumes at high speed. GPU power can contribute to our business. Advantages of our on-premises resources Functionalities • To build a flexible software stack • To link existing services Costs • Cloud fees remain high • Total on-premise costs will be lower in the long term
  44. To improve the platform by OSS stack Operation automation with

    Kubernetes Conclusion: our solutions DGX A100 GPUaaS (Kubernetes) AI Platform AI Platform The agility of application development will be increased by actively using OSS and improving the platform, which will have a significant impact on the business. AFF A800 AI Platform compatible with GCP High-performance GPU and storage
  45. ToDos GPUaaS • Automatic slicing of GPU instances (MIG) On-premises

    AI Platform • Serving implementation • Pipeline implementation A100 GPU / DGX A100 • Add more DGX A100 along with our business growth • Explore more new possibilities of MIG and Kubernetes • Integration of A100 with other GPUs (e.g. T4) for cost-efficiency