Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Rayleigh: 分散処理フレームワークRayの管理プラットフォーム / Rayleigh:...

Rayleigh: 分散処理フレームワークRayの管理プラットフォーム / Rayleigh: A Management Platform for Ray Clusters

Machine Learning Platform 部では、多種多様なサービスに対して、機械学習を活用した課題解決を提供しています。近年では、機械学習モデルの大規模化・高度化に伴い、Ray を用いた分散処理に取り組んできました。その中で、Ray の実運用を支える基盤として Rayleigh を開発し、開発スピードの向上と高いセキュリティレベルの両立を実現しています。
本発表では、私たちがなぜ Ray を選択したのか、Rayleigh がどのような課題を解決しているのかといった観点から、実際の取り組みについてご紹介します。

More Decks by LINEヤフーTech (LY Corporation Tech)

Other Decks in Technology

Transcript

  1. 1. Our Team 2. Background and Challenges 3. Ray and

    Rayleigh 4. Present 5. Future Agenda
  2. 1. Our Team 2. Background and Challenges 3. Ray and

    Rayleigh 4. Present 5. Future Agenda
  3. Machine Learning Platform Department • Approx. 40 people total •

    Independent from service/dev depts • Provide ML solutions for multiple services Machine Learning Server-side Infrastructure
  4. Data and Output Scale ex-LINE ML solutions supported by Machine

    Learning Platform Department Supports 10+ Analyze Logs of 100M+ Runs 100+ Services Jobs Per Day Global Users
  5. Products from Machine Learning Platform Machine Learning Platform ML API

    Liffy OFS … Smart Channel NEWS STICKERS MUSIC AD Online Feature Store (OFS) Data store oriented to high-throughput / low-latency random access Liffy A unified A/B testing platform that integrates those previously operated by LINE and Yahoo Japan. ML API AutoML-like Platform
  6. 1. Our Team 2. Background and Challenges 3. Ray and

    Rayleigh 4. Present 5. Future Agenda
  7. ML System Overview News Stickers Official Account Reranker Retriever Feature

    Extraction User Features Item Features ・・・ ANN Index Features Rating App Log User Log ETL Preprocess Services, Client App
  8. Challenges in 2024 Challenges Cold Start Problem Data and Models

    Fragmented Across Services Near Realtime Recommendation Limited Input Information
  9. Challenges in 2024 Challenges Multi-Modal Multi-Domain Robust Foundation Model Solutions

    Cold Start Problem Data and Models Fragmented Across Services Near Realtime Recommendation Limited Input Information
  10. Challenges in 2024 Cold Start Problem Data and Models Fragmented

    Across Services Near Realtime Recommendation Limited Input Information Challenges Multi-Modal Multi-Domain Robust Foundation Model Solutions Medium / Large Models State-of-the-art Technology Requirements MLP, .. + BERT, GPTX, .. Distributed Training, Memory Optimization
  11. Rating App Log ML System Overview Services, Client App User

    Log News Stickers User Features Item Features ・・・ ANN Index Official Account Reranker Retriever Feature Extraction Features ETL Preprocess
  12. Distributed Computing Libraries • In-house Python library • Performs distributed

    computing • May have limited compatibility with some third-party libraries. Ghee ※1 Itʼs still going strong! ※1. https://linedevday.linecorp.com/2020/en/sessions/9750/
  13. Distributed Computing Libraries • An open-source framework for easy distributed

    computing • Powers fast, scalable ML and data processing • Simple code, efficient execution • Works well with third-party libraries (e.g. DeepSpeed) Letʼs give it a try! Ray
  14. Rating App Log ML System Overview Services, Client App User

    Log News Stickers User Features Item Features ・・・ ANN Index Official Account Reranker Retriever Feature Extraction Features ETL Preprocess
  15. Rating App Log ML System Overview Services, Client App User

    Log News Stickers User Features Item Features ・・・ ANN Index Official Account Reranker Retriever Feature Extraction Features ETL Preprocess
  16. 1. Our Team 2. Background and Challenges 3. Ray and

    Rayleigh 4. Present 5. Future Agenda
  17. Verda/IU Kubernetes Early Trials with Ray Project Namespace Company-wide Kubernetes

    Ray Cluster Head Worker Worker ⓪ Setup kubectl ① Port forward to the head node ② Submit jobs Access to Ray Dashboard
  18. Verda/IU Kubernetes Early Trials with Ray Company-wide Kubernetes HDFS Storage

    Access Ray Cluster Head Worker Worker Dev Speed Project Namespace
  19. Verda/IU Kubernetes Early Trials with Ray Company-wide Kubernetes HDFS Log

    Storage Storage Access Log Persistence Ray Cluster Head Worker Worker Dev Speed Project Namespace
  20. Verda/IU Kubernetes Early Trials with Ray Company-wide Kubernetes HDFS Log

    Storage Storage Access Log Persistence Observability Ray Cluster Head Worker Worker Dev Speed Project Namespace
  21. Verda/IU Kubernetes Early Trials with Ray Project Namespace Company-wide Kubernetes

    Ray Cluster Ray Cluster Personal Secrets Personal Secrets
  22. Verda/IU Kubernetes Early Trials with Ray Company-wide Kubernetes Ray Cluster

    Ray Cluster Personal Secrets Personal Secrets Project Namespace
  23. Verda/IU Kubernetes Early Trials with Ray Company-wide Kubernetes Ray Cluster

    Ray Cluster Personal Secrets Personal Secrets Project Namespace Resource Quota
  24. Rayleigh: A Management Platform for Ray Clusters Ray Cluster Rayleigh

    Proxy Rayleigh Server Create Delete Submit a job Access to Dashboard
  25. CLI / SDK Rayleigh System Architecture RayCluster CR Server xDS

    Server Proxy Rayleigh Create, Delete Network Policy Head Worker Worker Ray Cluster KubeRay Target Namespace
  26. CLI / SDK Rayleigh System Architecture Head Worker Worker Ray

    Cluster RayCluster CR Server Network Policy xDS Server Proxy Dashboard, Submit jobs Rayleigh Authz / Routing
  27. Deploy Ray Clusters Anywhere on Verda ex-LINE Corporation ex-Yahoo Japan

    Corporation LY Corporation on IU on ACP on Flava Currently Supported Kubernetes Services on In-House Private Cloud
  28. Support the whole dev process Production Scheduled Job Job Submitter

    Ray Cluster RayJob CR KubeRay Develop RayCluster CR Research Ray Cluster KubeRay
  29. CLI / SDK Rayleigh System Architecture Head Worker Worker Ray

    Cluster RayJob CR Server Submitter Support RayJob
  30. CLI / SDK Rayleigh System Architecture Head Worker Worker Ray

    Cluster RayJob CR Server Submitter Log storage Application Log & Ray System Log Persistence Sidecar Sidecar Sidecar Sidecar
  31. 1. Our Team 2. Background and Challenges 3. Ray and

    Rayleigh 4. Present 5. Future Agenda
  32. Present: Rayleigh is still in the trial phase Ray +

    Rayleigh Environment “Vector Forge” Project Machine Learning Engineers Rayleigh Admins 20+ Users 10+ Ray Clusters
  33. Vector Forge Project • A project to develop a common

    robust model for generating features for recommendation systems. Item User Log Feature Ray + DeepSpeed Model https://tech-verse.lycorp.co.jp/2025/ja/session/1128/
  34. Voice of the Customer Things that are great about using

    Rayleigh ・MLEs can easily set up a distributed training environment in no time. This is a significant advantage. ・MLEs can train models in the same environment, ensuring reproducibility from an infrastructure standpoint.
  35. 1. Our Team 2. Background and Challenges 3. Ray and

    Rayleigh 4. Present 5. Future Agenda
  36. Become a Part of the Building Blocks Verda IU ACP

    Flava KubeRay Rayleigh KubeRay Ray ML API Vector Forge Yamla ML for LINE NEWS ML for LINE STICKERS ・・・ ・・・ Apps for Service ML Products Distributed Processing Engine Computing (Private Cloud) Ray on Laketahoe Ghee
  37. Ideas for Refactoring Ray Cluster Server (REST) xDS Server (gRPC)

    proxy CLI RDB v1 Ray Cluster Ray Cluster Server (gRPC) proxy CLI RDB v2 Ray Cluster Configs
  38. - Our Machine Learning Platform Department provides ML solutions for

    a wide range of services and is continuously working to improve them. Conclusion - In this project, we adopted Ray to efficiently experiment with multi-model setups and relatively large models. - To enhance both developer productivity and security when using Ray in our environment, we developed Rayleigh, which in turn contributed to projects like Vector Forge