Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

When should we use Kubernetes for the Machine L...

Asei Sugiyama
December 19, 2022

When should we use Kubernetes for the Machine Learning platform?

機械学習基盤として Kubernetes を採用するために必要となる、組織の能力について国内事例をもとに検討した資料です。この資料は Money Forward 社内で開かれた MLOps についての勉強会のために作成しました。

なお、資料内でさまざまな組織の取り組みについて触れていますが、これらは著者の私見です。現時点で公表されている資料に基づいていますが、各組織の公式な見解ではありません。

Asei Sugiyama

December 19, 2022
Tweet

More Decks by Asei Sugiyama

Other Decks in Technology

Transcript

  1. TOC Why Kubernetes for ML? <- Decisions to use Kubernetes

    and Kubeflow or not Requirements to use Kubernetes as an ML platform Summary
  2. Why Kubernetes for ML? Use cases of ML platform to

    consider Training/Serving Skew Container Kubernetes Kubeflow Let's use Kubernetes / Kubeflow for ML platform! Managed ML platform
  3. Use cases of ML platform to consider Data analytics &

    model development Model training Inference
  4. Data analytics & model development Workload Ad-hoc analysis Service independent

    model development Google Cloud を活用した大手エンタメ企業様のデータ分析基盤構築事例 https://youtu.be/BTYO0-avsXI Beyond Interactive: Notebook Innovation at Netflix https://netflixtechblog.com/notebook-innovation-591ee3221233
  5. Requirements Easy and safe access to the large dataset Visualization

    without code Not required Version control High availability Google Cloud を活用した大手エンタメ企業様のデータ分析基盤構築事例 https://youtu.be/BTYO0-avsXI Beyond Interactive: Notebook Innovation at Netflix https://netflixtechblog.com/notebook-innovation-591ee3221233
  6. Model training Workload Batch processing (training pipeline) MLOps: Continuous delivery

    and automation pipelines in machine learning https://cloud.google.com/architecture/mlops-continuous-delivery-and- automation-pipelines-in-machine-learning
  7. Model training Requirements Massive amount of compute resources (CPUs, Mems,

    Accelerators) Massive amount of storage access (IOPS, Network bandwidth) Visualization Version control (code, data, model, and lineage between them) Not required High availability
  8. Inference Workload Web API Batch processing Accelerated Computing on AWS

    for NLP https://speakerdeck.com/icoxfog417/accelerated-computing-on-aws-for- nlp
  9. Inference Requirements (Web API) Low latency High availability Scalability Version

    control (code, data, model, and lineage between them) Not required (Web API) Massive amount of storage access for each request (hopefully)
  10. Training/Serving Skew Same code at the different three use cases

    Moreover, we have to consider dev/staging/prod. Or, Training/Serving Skew. Caused by the difference between environments. Why We Need DevOps for ML Data https://www.tecton.ai/blog/devops-ml-data/
  11. Container What we should manage between these environments are: Code

    Libraries Driver (CUDA, etc) OS Container (and Machine Image, in the past) is the defacto standard format for this business.
  12. Kubernetes "Kubernetes is an open-source system for automating deployment, scaling,

    and management of containerized applications." We can deploy Web service and batch execution on Kubernetes
  13. Kubeflow "The Kubeflow project is dedicated to making deployments of

    machine learning (ML) workflows on Kubernetes simple, portable and scalable." At the start point, it was an open-source implementation of the Google internal ML platform (TFX). Now, Kubeflow has no restrictions on libraries and cloud services.
  14. Let's use Kubernetes / Kubeflow for ML platform! Be careful

    to use Kubernetes or Kubeflow as an ML platform. Both Kubernetes and Kubeflow requires huge amount of effort. Several company tried to use Kubeflow and decided to use managed ML platform.
  15. Managed ML platform Vertex AI: Build, deploy, and scale machine

    learning (ML) models faster, with fully managed ML tools for any use case. SageMaker: Build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows. Origin of these services are their internal ML platform (Google & Amazon).
  16. TOC Why Kubernetes for ML? Decisions to use Kubernetes and

    Kubeflow or not <- Requirements to use Kubernetes as an ML platform Summary
  17. Decisions to use Kubernetes and Kubeflow or not Ride managed

    ML platform (Vertex AI) Tried Kubeflow but left Experts of ML on Kubernetes Container platform hopper (Challenger) Hybrid: Vertex & Kubernetes
  18. Ride managed ML platform (Vertex AI) CADDi Small team &

    fast deliver CADDi AI Labにおけるマネ ージドなMLOps OpenSearchで実現する画 像検索とテスト追加で目指 す安定運用 CADDi AI LabにおけるマネージドなMLOps https://speakerdeck.com/vaaaaanquish/caddi-ai- labniokerumanezidonamlops
  19. Ride managed ML platform (Vertex AI) CAM (CyberAgent Group) Small

    team VertexAIで構築したMLOps基盤の取り組み https://speakerdeck.com/cyberagentdevelopers/vertexaidegou-zhu- sitamlopsji-pan-falsequ-rizu-mi
  20. Tried Kubeflow but left Repro Kubeflow is too painful to

    use Cannot update Kubeflow (delete & create) Fine grained log costs too high (with Prometheus) Too expensive to keep watching Kubeflow & Kubernetes Use Vertex AI to avoid managing Kubernetes & Kubeflow
  21. Tried Kubeflow but left mercari Building internal ML platform is

    too expensive Hard to maintain the code base after key engineer left the company Decide to use Kubeflow, then, use Vertex AI
  22. Tried Kubeflow but left ZOZO Hosting multi tenancy Kubeflow is

    too expensive Tons of YAMLs and customizations Hard to scale in the team Use Vertex AI to avoid hosting Kubeflow by themselves KubeflowによるMLOps基盤構築から得られた知見と課題 https://techblog.zozo.com/entry/mlops-platform-kubeflow
  23. Experts of ML on Kubernetes LINE From historical and security

    reason, they have extreme on-prem clusters Excellence in managing bare metal servers and Kubernetes Lupus - A Monitoring System for Accelerating MLOps https://speakerdeck.com/line_devday2021/lupus-a-monitoring-system-for- accelerating-mlops
  24. Experts of ML on Kubernetes Yahoo! Japan From historical reason,

    they have extreme on-prem clusters Excellence in managing bare metal servers and Kubernetes Huge amount of investment in Kubernetes 継続的なモデルモニタリングを実現するKubernetes Operator https://www.slideshare.net/techblogyahoo/kubernetes-operator-251612755
  25. Experts of ML on Kubernetes PFN Powered user of the

    machine learning (ML researchers) They need bare metal server to; 1. use GPUs and CPUs as much as possible 2. create their chip (accelerator) and test on their servers 継続的なモデルモニタリングを実現するKubernetes Operator https://www.slideshare.net/techblogyahoo/kubernetes-operator-251612755
  26. Rakuten From historical reason, they have extreme on-prem clusters Excellence

    in managing bare metal servers and Kubernetes Kubernetesによる機械学習基盤、楽天での活用事例 覃子麟 (チンツーリン) /楽天 株式会社 https://www.slideshare.net/rakutentech/kubernetes-144707493? from_action=save 楽天の規模とクラウドプラットフォーム統括部の役割 https://www.slideshare.net/rakutentech/ss-253221883
  27. Container platform hopper (Challenger) ABEJA Docker Swarm -> Rancher ->

    Kubernetes (EKS) Excellence in Kubernetes ABEJAの技術スタックを公開します (2019年11月版) https://tech- blog.abeja.asia/entry/tech-stack-201911 ABEJA Insight for Retailの技術スタックを公開します (2021年10月版) https://tech-blog.abeja.asia/entry/retail-tech-stack-202110
  28. Hybrid: Vertex & Kubernetes DeNA Move from Serverless services to

    Vertex Pipelines (Training) & Kubernetes (Inference) DeNA の MLops エンジニアは何をしてるのか【DeNA TechCon 2021 Winter】 https://speakerdeck.com/dena_tech/techcon2021-winter-5
  29. TOC Why Kubernetes for ML? Decisions to use Kubernetes and

    Kubeflow or not Requirements to use Kubernetes as an ML platform <- Summary
  30. Requirements to use Kubernetes as an ML platform Using Kubernetes

    as a platform everywhere in the organization Capability to customize Kubernetes & Kubeflow Strong heart to bear the pain caused by breaking change
  31. Summary Container is a good practice for ML to avoid

    training/serving skew Be careful to use Kubernetes & Kubeflow as an ML platform The minimum requirement to use Kubernetes as an ML platform is the capability to customize Kubernetes to fit your use cases Consider hybrid approach: managed service for training & inference service on Kubernetes