Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kubernetes-based GPU as a Service Platform at C...

Kubernetes-based GPU as a Service Platform at CyberAgent [INSIGHT 2021 Digital]

We will introduce an example of building and providing a platform that continues to evolve flexibly on NetApp® AFF A800 and NVIDIA DGX A100, adopting OSS such as Kubernetes with Trident.

Daisuke Takahashi

October 20, 2021
Tweet

More Decks by Daisuke Takahashi

Other Decks in Technology

Transcript

  1. Who Are We? AI Platform Project Manager After joining CyberAgent

    in 2016, worked as a solutions architect for ad products and was in charge of private cloud and GKE-compatible container platform development. As AI infrastructure project manager, engaged in AI platform development. Lee Yeongjae Masaya Aoyama Daisuke Takahashi Software Engineer Hired by CyberAgent out of college in 2016. Built a private cloud and GKE-compatible container platform from scratch with OpenStack. Co-chair of Japan's biggest Cloud Native conference, official CNCF community organizer, etc. Infrastructure Engineer Hired by CyberAgent out of college in 2019. After working on private cloud operation and container infrastructure development, now leads the ML/3DCG/adtech domains as a Solutions Architect. Handled physical layer design and operation for this project.
  2. Agenda 1. CyberAgent 2. The Necessity of GPUs 3. Hardware

    4. Kubernetes-based GPU-as-a-Service Platform 5. AI Platform 6. Wrap & Future Outlook
  3. "To create the 21st Century's Leading Company" Media Offers many

    services tailored to the changing internet industry ➔ AbemaTV ➔ AWA ➔ WinTicket Advertising Applies operational and creative capabilities for maximum ad impact to offer comprehensive solutions, including AI-powered adtech Video Games Offers about 50 smartphone games, including 8 major titles ➔ Umamusume: Pretty Derby ➔ GRANBLUE FANTASY ➔ PRINCESS CONNECT! Re:Dive 3 Main Segments *"Abema": © Abema TV, Inc. **"Umamusume: Pretty Derby," "GRANBLUE FANTASY": © Cygames, Inc.
  4. Agenda 1. CyberAgent 2. The Necessity of GPUs 3. Hardware

    4. Kubernetes-based GPU-as-a-Service Platform 5. AI Platform 6. Wrap & Future Outlook
  5. Why Do We Need AI Solutions? Reducing costs • Gain

    business domain knowledge • Automated ad generation Improving impact • Identify unexpectedly effective ads • Improve ad impact analysis Reducing risk • Avoid risk of controversies from ad placement
  6. Why Do We Have to Use GPUs? Processing of massive,

    complex info • The combination of ads and posted media is huge • Enormous info referencing population statistics (area, age, etc.) Fast execution • Trends change fast • Get real-time info
  7. AI Solutions Execution Environment Users: Researchers, Data Scientists, MLOps Engineers

    Jupyter Notebook : Execution environment for interactive GUI programs Google AI Platform : Manage ML workflows with Client Tools/GUI Train & evaluate model Deploy model Model with inferences Monitor inferences Manage model versions Implement code, prepare input data
  8. System Architecture DGX A100 AFF A800 GPUaaS (Kubernetes) AI Platform

    DGX A100 + AFF A800 → Offer performance GPU & storage GPU-as-a-Service → Offers Jupyter Notebook Original AI platform → Offers infrastructure equivalent to Google AI platform
  9. Why On-Premises? Features • Flexible software stack assembly • Easy

    connections with existing services Costs • High cloud expenses • Inexpensive over the long term
  10. Agenda 1. CyberAgent 2. The Necessity of GPUs 3. Hardware

    4. Kubernetes-based GPU-as-a-Service Platform 5. AI Platform 6. Wrap & Future Outlook
  11. History of GPUaaS v1: GPU containers • Implemented central management

    of GPU resources for researchers ◦ Assigns an entire host to each researcher exclusively v2: GPU containers + Jupyter Notebook • Managed Notebook environment for researchers • Or primitive GPU containers, just as v1 v3: GPU containers + Jupyter Notebook + AI Platform • Expanded availability to developers in addition to researchers • Hosting AI platform (GCP-compatible) on top of GPUaaS
  12. GPUaaS v1 Efficient use of GPU resources • Centralized management

    of researchers’ workstations ◦ Assigns 1 host (node) per user • Located at server room in our office ◦ GPU: 20x NVIDIA GeForce GTX 1080Ti 11 GB (220 GB) ◦ CPU: 324 cores ◦ Memory: 1.28 TB Environment for easier reproduction • Simplified recreation of experiment environment with container virtualization • Adopted Kubernetes, a proven option at CyberAgent ◦ Offers direct access to Kubernetes API
  13. GPUaaS v2 Successor to GPUaaS v1 • Migrated GPU resources

    from v1 • Changed assignment policy to shared use (multi-tenancy) NEW: Shared storage for datasets • Could mount same data from containers • Software-defined storage on Kubernetes ◦ NFS service by Rook (Ceph) ◦ Usable Capacity: 48 TB with SATA SSDs NEW: Managed training environment • Launch Jupyter Notebook w/o Kubernetes knowledge • Could bring custom container images, optionally
  14. Operational Issues for v2 Location / Site • Office building

    is NOT a datacenter ◦ Reached the limit of power and cooling ◦ Regular blackouts for legal inspection • Poor connection quality ◦ Site-to-site VPN only ◦ Non-redundant network Machine maintenance • Lack of remote management feature ◦ BMC not equipped (field ops required) ◦ Restricted access to office for COVID-19 Performance • Insufficient GPU memory ◦ GeForce series not designed for ML • Outdated hardware ◦ Newer CPUs and GPUs come out ◦ Increasing rate of hardware failures
  15. Considerations for Improving GPUaaS Developers also want to use researcher-favored

    platform for their services To achieve the required quality, we had to address issues in v2 Location / Site • Escape from office building • Use existing datacenter in Tokyo for our private cloud Specs • Brand-new servers for GPUaaS (IPMI required) • Enterprise-grade GPUs with massive memory ◦ Tesla V100, T4, etc.
  16. NVIDIA A100 Ampere Architecture • 20x faster than the predecessor

    (V100) New hardware features • Multi-Instance GPU, Sparsity, etc. Faster GPU-to-GPU interconnection • 3rd Gen NVLink2nd Gen NVSwitch • Up to 16 GPUs • Full mesh topology at 600 Gbps each
  17. MIG: Multi-Instance GPU MIG mode in the NVIDIA Ampere architecture

    can run seven jobs in parallel on an A100 GPU (NVIDIA Blog) Multi-tenancy • For DGX A100, its 8 GPUs can be sliced into 56 GPU instances • Administrators can assign right-sized GPUs for each job Guaranteed QoS • All GPU instances include isolated memory (capacity/bandwidth) and cores
  18. GPUaaS v3 Renewed GPU hardware • Adopted NVIDIA DGX A100

    ◦ GPU: 8x NVIDIA A100 40 GB (320 GB) ◦ CPU: 128 cores ◦ Memory: 1 TB ◦ Testing combination of new HW features and Kubernetes (Thanks to the people at NVIDIA for helping out!) ▪ Details on software later • Installed in DGX-Ready datacenter ◦ Verified location for power, cooling, installation
  19. Storage Improvements Constraints for storage in v2 (hardware specs) •

    Low capacity efficiency per rack space → Should introduce large capacity drives and/or chassis with many disk slots • Insufficient throughput for transferring datasets (compared with A100 GPU's performance) → Should improve disk and network performance Focus on using storage, not operating • Rook (Ceph) was a suitable option to reuse existing resources ◦ Not motivated to operate SDS since the purpose is providing storage space → Should consider appliances, not just SDS Additional features • Want block access for internal metadata DBs of GPUaaS
  20. GPUaaS v3 (contd.) Revamped storage for datasets • Adopted NetApp

    AFF A800 ◦ NVMe SSD 62 TB (All-lash) ▪ Capable to scale-out/scale-up by adding: • Disks (into empty bays) • Disk shelves • Controllers ◦ Multi-protocol access ▪ File (NFS, SMB), Block (iSCSI ,etc.), Object (S3) ◦ Details on Kubernetes integration later • Selected with NVIDIA DGX POD in mind ◦ Scalable reference architecture for DGX system and storage ◦ NetApp announced as ONTAP AI * Photo of the evaluation system. Some configurations differ.
  21. Reference: v3 Hardware Overview 100 GbE 25 GbE Compute NVIDIA

    DGX A100 Network Mellanox SN2010 Storage NetApp AFF A800
  22. Agenda 1. CyberAgent 2. The Necessity of GPUs 3. Hardware

    4. Kubernetes-based GPU-as-a-Service Platform 5. AI Platform 6. Wrap & Future Outlook
  23. In-house infrastructure providing GPU interface to users and upper services

    (Minimum) requirements • Provision of desired resources cut out from pooled computing resources • Isolated GPUs to prevent interference during task execution • High-performance storage allowing simultaneous connections Adopted infrastructure: containers + Kubernetes GPU-as-a-Service Overview Container icons: https://icons8.jp/icons/set/video-card Computing resource pool Storage pool
  24. Containers vs. VM vs. Bare Metal • Advantages of containers

    ◦ Can easily create an image of the execution environment (cf. VM, metal) ◦ Low overhead, loads in short time (cf. VM) ◦ Can implement multi-tenancy environments for multiple users (cf. Metal) • Disadvantages of containers ◦ Low isolation compared to VM (cf. VM) ◦ Short environment life cycle (cf. VM, Metal)
  25. Containers vs. VM vs. Bare Metal • Advantages of containers

    ◦ Can easily create an image of the execution environment (cf. VM, metal) ◦ Low overhead, loads in short time (cf. VM) ◦ Can implement multi-tenancy environments for multiple users (cf. Metal) • Disadvantages of containers ◦ Low isolation compared to VM (cf. VM) ◦ Short environment life cycle (cf. VM, Metal)
  26. Containers vs. VM vs. Bare Metal • Advantages of containers

    ◦ Can easily create an image of the execution environment (cf. VM, metal) ◦ Low overhead, loads in short time (cf. VM) ◦ Can implement multi-tenancy environments for multiple users (cf. Metal) • Disadvantages of containers ◦ Low isolation compared to VM (cf. VM) ◦ Short environment life cycle (cf. VM, Metal)
  27. Containers vs. VM vs. Bare Metal • Advantages of containers

    ◦ Can easily create an image of the execution environment (cf. VM, metal) ◦ Low overhead, loads in short time (cf. VM) ◦ Can implement multi-tenancy environments for multiple users (cf. Metal) • Disadvantages of containers ◦ Low isolation compared to VM (cf. VM) ◦ Short environment life cycle (cf. VM, Metal)
  28. Containers vs. VM vs. Bare Metal • Advantages of containers

    ◦ Can easily create an image of the execution environment (cf. VM, metal) ◦ Low overhead, loads in short time (cf. VM) ◦ Can implement multi-tenancy environments for multiple users (cf. Metal) • Disadvantages of containers ◦ Low isolation compared to VM (cf. VM) ◦ Short environment life cycle (cf. VM, Metal)
  29. Kubernetes In a production environment, we must manage host machines

    on multiple containers. That means we need container orchestration tools. Kubernetes is one container orchestration tool. Computing resource pool Storage pool • Storage system ◦ Block ◦ Shared filesystem ◦ Others • Scheduling • Rolling updates • Health checks • Auto/Scaling • Malfunction self-healing • Authentication & authorization • Service discovery • Load balancing • Attach confidential info • Multi-tenancy • Integration with storage
  30. GPU and Kubernetes Kubernetes allows device plugins. Device plugins are

    plugable. Kubernetes can handle various devices. • GPU (NVIDIA/AMD/Intel) • TPU (Tensor Processing Unit) • FPGA • etc. We use Prometheus + DCGM Exporter for monitoring. Users have a dashboard to check GPU usage, etc. Note: You can optimize NUMA and GPU Topology by using InfiniBand with Kubernetes. (Container runtime must be compatible) containers: - name: mljob image: my-mljob:v0.1 resources: limits: nvidia.com/gpu: 1 nvidia.com/mig-3g.20gb: 1
  31. Kubernetes' Advantages Reconciliation loop to bring into a declared state

    Huge ecosystem Highly extendable platform 1 2 3
  32. 1. Reconciliation Loop to Bring into a Declared State Kubernetes

    uses programs called controllers to control systems. Multiple controllers bring things into a declared state. ◦ Maintain specified number of replicas ◦ Recovery from containers shut down by broken nodes ◦ Auto reload when changing confidential info or config file ◦ Auto management of load balancer members ◦ etc. Actual ReplicaSet (replicas=3) Watch ReplicaSet Controller kind: ReplicaSet spec: replicas: 3 template: spec: containers: - image: nginx:1.16 Desired State Watch
  33. 2. Huge Ecosystem The CNCF and Kubernetes community promote open

    technology and develop and release various OSS integrated with Kubernetes. By using an OSS expansion controller employing a reconciliation loop, you can let Kubernetes handle most routine operations. • Prometheus / Grafana: monitors GPU and wide-ranging middleware • cert-manager: manages auto generation of certificates using ACME; auto integrates with load balancers • external-dns: manages provided IP address and DNS records • oauth2-proxy + nginx ingress: integrates OAuth2 with requests to WebUI • Others: auto scaling, progressive delivery, data transmission between projects, etc.
  34. 3. Highly Extendable Platform Kubernetes is made to be easily

    extendable and enables expansion of features tailored to our company's specific domains. It also offers a framework for mounting controllers (even using general OSS). Examples: • Auto data load + cash from S3/GCS (Custom Controller) • Auto inject of authentication info for cloud (Mutating Webhook) • Metadata storage using application instead of database (Secret/ConfigMap) • Billing system based on info retrieved from Kubernetes Our implementation also keeps pace with standardization: container runtime (OCI/CRI), networks (CNI), storage (CSI), etc.
  35. Kubernetes' Advantages Reconciliation loop to bring into a declared state

    Huge ecosystem Highly extendable platform • Restoration capability • Easy management • Observability • Frequent updates with robust automation • etc. 1 2 3 ⇒ As we track the evolution of OSS, we are guiding business to success by continuing to improve upon platforms.
  36. Linking Storage with Kubernetes • We use the Container Storage

    Interface (CSI) to integrate storage with Kubernetes. • CSI is an interface that connects container orchestrators with storage. It can handle multiple orchestrators and multiple storage products. CSI only defines open specifications. Features used differ according to the CSI driver. ◦ https://github.com/container-storage-interface/spec ◦ https://kubernetes-csi.github.io/docs/drivers.html Storage Container Orchestrator Container Storage Interface • Volume production/deletion • Volume attachment/detachment • Volume expansion • Volume cloning • Snapshot & restore • Topology designation • RAW volume production
  37. CSI Features Set The CSI Driver in Kubernetes is divided

    into multiple sub-features. Storage features are adequate but unusable without a compatible CSI driver. The lack of infrastructure features could prevent upper services from providing value. CSI driver considerations when selecting storage: 1. Tracking speed of Kubernetes upstream features (release frequency, upstream participation) 2. CSI driver quality (including bug fix speed) Since before CSI, NetApp has made Trident. It has very good release frequency, features, and quality. Container Orchestrator Container Storage Interface Storage Trident
  38. Trident as OSS Trident is released as OSS and CSI

    Driver implementation is not a black box. • Our team wants to avoid always having to wait when a problem occurs Excellent development organization • Three-month release cycle (since Dec 2016) • Proactive contributions to the community mean we can expect fast upstream response Compatibility with both ReadWriteOnce and ReadWriteMany (AFF at CyberAgent) • NAS: for ML workloads • SAN: for system applications building GPUaaS (databases, Prometheus, etc.)
  39. Two Ways to Use GPU-as-a-Service GPUaaS is operated with a

    dedicated web console or kubectl. Kubernetes API Server GPUaaS API Server $ kubectl ... • launch notebooks • Manage volumes • Show billing info • Manage projects • etc. Web Console kubectl A B
  40. Multi-tenancy in Kubernetes The Kubernetes Namespace feature has the same

    concept as tenants. We use ClusterRole and RoleBinding to manage permissions. "Add member to project from WebUI Console" = "Add RoleBinding" This allows seamless management. (It's like using a user database.) Other processes also possible on the WebUI Console, depending on role. UserA namespace UserB namespace TeamX namespace A B TeamY namespace admin clusterrole member clusterrole rolebinding
  41. Multi-tenant Management with Hierarchical Namespaces Hierarchical Namespace Controller assigns shared

    settings to all namespaces Policy, config data, shared cloud account authentication info, metadata (CustomResource), etc. UserA namespace UserB namespace TeamX namespace GPUaaS namespace policy policy policy policy A B
  42. Providing GPU Instances 1. Offer Jupyter Notebooks with original web

    console For users unfamiliar with Kubernetes: data scientists, researchers, etc. 2. SSH-like environment using Kubernetes CLI $ kubectl exec -it PODNAME-0 -- bash PODNAME-0 #
  43. Providing GPU Instances 3. Provide infrastructure to upper services For

    developing machine learning infrastructure based on Kubernetes Can't implement as desired if lower layer features are inadequate ⇒ Storage features and CSI driver functionality are important DGX A100 AFF A800 AI Platform GPUaaS (Kubernetes) AI Platform Consider multi-DC rollout using Kubernetes portability
  44. Agenda 1. CyberAgent 2. The Necessity of GPUs 3. Hardware

    4. Kubernetes-based GPU-as-a-Service Platform 5. AI Platform 6. Wrap & Future Outlook
  45. Due to slower development, want to make complete with GCP/AWS

    Can't train easily like on an AI platform Don't use since migration from cloud is hard (Multiple responses) What do you not like about the currently available GPUaaS? Why We Need an Original AI Platform
  46. Our Google AI infrastructure for managing ML workflows • Overview

    of Training System • Overview of Inference System AI Platform Overview
  47. AI Platform Requirements Machine learning & inferences • Can offload

    Google AI platform ML workflows ◦ Object storage as hub • Same operability as Google AI platform ◦ kubectl plugin features • Can reuse Google AI platform configuration files Training • Capable of hyperparameter tuning ◦ Use Kubeflow, a Katib component Inferences • Can make inference endpoints ◦ Use Kubeflow, a KFServing component • Model version management ◦ Use our original model metadata management infrastructure • External access to inference endpoints ◦ Authorization with Istio and External Authorization
  48. Consistent Operability kubectl ai-platform jobs submit training... kubectl ai-platform jobs

    submit prediction... kubectl ai-platform predict... kubectl ai-platform models... kubectl ai-platform versions... On-prem resource gcloud ai-platform jobs submit training... gcloud ai-platform jobs submit prediction... gcloud ai-platform predict... gcloud ai-platform models... gcloud ai-platform versions... Cloud resource Google AI Platform Definition
  49. Toolkit for Building ML Workflows on Kubernetes https://www.kubeflow.org/docs/started/kubeflow-overview/ • Can't

    choose deployed environment ◦ On-premises deployment • Resource mgt. with Kubernetes ◦ Expandable; big ecosystem ◦ Object control with manifest • Hyperparameter tuning ◦ Katib • Inference endpoint production ◦ KFServing Kubeflow
  50. Katib Components Experiment Suggestion Trial Trial Trial TFJob/PytorchJob/Job Pod Worker

    Container Experiment • Individual execution of hyperparameter tuning • Write all settings (algorithms, etc.) Suggestion • Hyperparameter generation Trial • Training executed with the hyperparameters Metrics Container • Save model training prediction accuracy as metrics Metrics Collector • Write metrics to database and complete tuning Metrics Collector Katib DB Metrics Container
  51. Job Submit Flow for AI Platform Training 1. Execute submit

    command 2. Compress model code 3. Send compressed code to GCS/S3 4. Create Katib experiment 5. Trial makes pods (+PV attachment) 6. Download compressed code 7. Execute training 8. Send model to GCS/S3 1 2 3 4 5 6 7 8 icons: https://icons8.jp/icons/set/video-card
  52. KFServing Overview • Provides inference features (defined with InferenceServer) •

    Makes model serving abstract (compatible with TensorFlow, PyTorch, XGBoost, etc.) • Manages serving containers (Knative) • Manages traffic routing (Istio) Special features • Auto scaling • Canary rollout, A/B tests • Prediction, preprocessing, post-processing, etc. https://github.com/kubeflow/kfserving/blob/master/docs/diagrams/kfserving.png
  53. InferenceServer • KFServing custom resource definition • Inference endpoint entities

    ◦ Containers loaded by models ◦ Provide inference endpoint (FQDN) • Both preprocessing and post-processing OK • PodSpec descriptions allow custom containers
  54. Original Model Metadata Management Infrastructure Overview • Manages metadata assignments

    to models • Originally an infrastructure for follow-up tests and reproduction • Development/operation in other departments Special features • Savable model version histories • Can tie metadata to models ◦ Code, datasets, etc. • Controllable model access rights ◦ 3 patterns: Read, Write, RW • Can designate model location ◦ Compatible with GCS/S3
  55. Istio and External Authorization External authorization • Part of envoy

    feature • Transfers requested authorization to authorized external server • Can mount original authorized logic • Authorization mounts predetermined REST API/gRPC Service • Using Istio simplifies settings apiVersion: security.istio.io/v1beta1 kind: AuthorizationPolicy metadata: name: ext-authz spec: selector: matchLabels: app: istio-ingressgateway action: CUSTOM provider: name: "ext-authz-service" rules: - to: - operation: paths: ["/auth/*"] extensionProviders: - name: "ext-authz-service" envoyExtAuthzGrpc: service: "ext-authz.default.svc.cluster.local" port: 9000 Register authorization Register authorized server
  56. AI Platform Prediction Flow Register model 1. Execute "models create"

    command 2. Generate model creation request on API server 3. Register model on metadata server Execute prediction 8. Execute "predict" command 9. Authorization on external authorization server 10. Transfer request to InferenceServer 11. Execute inference 12. Return inference response Create model version 4. Execute "versions create" command 5. Generate InferenceServer creation request on API server 6. Apply InferenceServer manifest 7. Create InferenceServer (+ download model) 1 2 3 4 5 9 7 8 10 12 11 6 icons: https://icons8.jp/icons/set/video-card
  57. Agenda 1. CyberAgent 2. The Necessity of GPUs 3. Hardware

    4. Kubernetes-based GPU-as-a-Service Platform 5. AI Platform 6. Wrap & Future Outlook
  58. Wrap Why are GPUs necessary? ★ To quickly process massive,

    complex data On-premises advantages Features • Flexible software stack assembly • Easy connections with existing services Costs • High cloud expenses • Inexpensive over the long term
  59. Features always improving with OSS Maximize Kubernetes' advantages Wrap DGX

    A100 GPUaaS (Kubernetes) AI Platform AI Platform By making aggressive use of OSS and improving upon platforms, we can make application development more agile and have a big impact on business. AFF A800 Google AI platform compatibility Ultra-high-performance GPUs/storage
  60. Future Outlook GPUaaS • Automated MIG partitioning features AI platform

    • Pipeline features ◦ ML workflow automation • Black box optimization features Physical hardware • Augmented DGX A100/A100 GPU and AFF A800 • Improve cost-effectiveness by integrating with other GPUs (T4, etc.)