Rayleigh: 分散処理フレームワークRayの管理プラットフォーム / Rayleigh: A Management Platform for Ray Clusters

Rayleigh: A Management Platform for Ray Clusters Kazuki Yoshida Machine
Learning Platform Department

Software Engineer @ Machine Learning Platform Department Kazuki Yoshida

1. Our Team 2. Background and Challenges 3. Ray and
Rayleigh 4. Present 5. Future Agenda

Machine Learning Platform Department • Approx. 40 people total •
Independent from service/dev depts • Provide ML solutions for multiple services Machine Learning Server-side Infrastructure

Data and Output Scale ex-LINE ML solutions supported by Machine
Learning Platform Department Supports 10+ Analyze Logs of 100M+ Runs 100+ Services Jobs Per Day Global Users

Products from Machine Learning Platform Machine Learning Platform ML API
Liffy OFS … Smart Channel NEWS STICKERS MUSIC AD Online Feature Store (OFS) Data store oriented to high-throughput / low-latency random access Liffy A unified A/B testing platform that integrates those previously operated by LINE and Yahoo Japan. ML API AutoML-like Platform

ML System Overview News Stickers Official Account Reranker Retriever Feature
Extraction User Features Item Features ・・・ ANN Index Features Rating App Log User Log ETL Preprocess Services, Client App

Challenges in 2024 Challenges Cold Start Problem Data and Models
Fragmented Across Services Near Realtime Recommendation Limited Input Information

Challenges in 2024 Challenges Multi-Modal Multi-Domain Robust Foundation Model Solutions
Cold Start Problem Data and Models Fragmented Across Services Near Realtime Recommendation Limited Input Information

Challenges in 2024 Cold Start Problem Data and Models Fragmented
Across Services Near Realtime Recommendation Limited Input Information Challenges Multi-Modal Multi-Domain Robust Foundation Model Solutions Medium / Large Models State-of-the-art Technology Requirements MLP, .. + BERT, GPTX, .. Distributed Training, Memory Optimization

Rating App Log ML System Overview Services, Client App User
Log News Stickers User Features Item Features ・・・ ANN Index Official Account Reranker Retriever Feature Extraction Features ETL Preprocess

Distributed Computing Libraries • In-house Python library • Performs distributed
computing • May have limited compatibility with some third-party libraries. Ghee ※1 Itʼs still going strong! ※1. https://linedevday.linecorp.com/2020/en/sessions/9750/

Distributed Computing Libraries • An open-source framework for easy distributed
computing • Powers fast, scalable ML and data processing • Simple code, efficient execution • Works well with third-party libraries (e.g. DeepSpeed) Letʼs give it a try! Ray

Rating App Log ML System Overview Services, Client App User
Log News Stickers User Features Item Features ・・・ ANN Index Official Account Reranker Retriever Feature Extraction Features ETL Preprocess

Verda/IU Kubernetes Early Trials with Ray Project Namespace Company-wide Kubernetes
Ray Cluster Head Worker Worker KubeRay RayCluster CR

Ray Cluster Head Worker Worker ⓪ Setup kubectl ① Port forward to the head node ② Submit jobs Access to Ray Dashboard

Verda/IU Kubernetes Early Trials with Ray Company-wide Kubernetes Ray Cluster
Head Worker Worker Project Namespace

Head Worker Worker Dev Speed Project Namespace

Verda/IU Kubernetes Early Trials with Ray Company-wide Kubernetes HDFS Storage
Access Ray Cluster Head Worker Worker Dev Speed Project Namespace

Verda/IU Kubernetes Early Trials with Ray Company-wide Kubernetes HDFS Log
Storage Storage Access Log Persistence Ray Cluster Head Worker Worker Dev Speed Project Namespace

Verda/IU Kubernetes Early Trials with Ray Company-wide Kubernetes HDFS Log
Storage Storage Access Log Persistence Observability Ray Cluster Head Worker Worker Dev Speed Project Namespace

Ray Cluster Ray Cluster Personal Secrets Personal Secrets

Ray Cluster Personal Secrets Personal Secrets Project Namespace

Ray Cluster Personal Secrets Personal Secrets Project Namespace Resource Quota

Verda/IU Kubernetes Ray is powerful Ray

Verda/IU Kubernetes Ray is powerful ̶ even more capable with
Rayleigh! Ray + Rayleigh

Rayleigh: A Management Platform for Ray Clusters Ray Cluster Rayleigh
Proxy Rayleigh Server Create Delete Submit a job Access to Dashboard

CLI / SDK Rayleigh System Architecture RayCluster CR Server xDS
Server Proxy Rayleigh Create, Delete Network Policy Head Worker Worker Ray Cluster KubeRay Target Namespace

CLI / SDK Rayleigh System Architecture Head Worker Worker Ray
Cluster RayCluster CR Server Network Policy xDS Server Proxy Dashboard, Submit jobs Rayleigh Authz / Routing

Deploy Ray Clusters Anywhere on Verda ex-LINE Corporation ex-Yahoo Japan
Corporation LY Corporation on IU on ACP on Flava Currently Supported Kubernetes Services on In-House Private Cloud

Deploy Ray Clusters by Anyone For Ray Beginners

Deploy Ray Clusters by Anyone For Ray Advanced Users

Secure interaction with Ray Jobs API Compatible with Ray official
CLI

Secure Access to Ray Dashboard

Support the whole dev process Production Scheduled Job Job Submitter
Ray Cluster RayJob CR KubeRay Develop RayCluster CR Research Ray Cluster KubeRay

Cluster RayJob CR Server Submitter Support RayJob

Cluster RayJob CR Server Submitter Log storage Application Log & Ray System Log Persistence Sidecar Sidecar Sidecar Sidecar

Present: Rayleigh is still in the trial phase Ray +
Rayleigh Environment “Vector Forge” Project Machine Learning Engineers Rayleigh Admins 20+ Users 10+ Ray Clusters

Vector Forge Project • A project to develop a common
robust model for generating features for recommendation systems. Item User Log Feature Ray + DeepSpeed Model https://tech-verse.lycorp.co.jp/2025/ja/session/1128/

Voice of the Customer Things that are great about using
Rayleigh ・MLEs can easily set up a distributed training environment in no time. This is a significant advantage. ・MLEs can train models in the same environment, ensuring reproducibility from an infrastructure standpoint.

Accomplishments of the Vector Forge Project +20% +22% +21% +21%
+18%

Become a Part of the Building Blocks Verda IU ACP
Flava KubeRay Rayleigh KubeRay Ray ML API Vector Forge Yamla ML for LINE NEWS ML for LINE STICKERS ・・・・・・ Apps for Service ML Products Distributed Processing Engine Computing （Private Cloud） Ray on Laketahoe Ghee

Ideas for Refactoring Ray Cluster Server (REST) xDS Server (gRPC)
proxy CLI RDB v1 Ray Cluster Ray Cluster Server (gRPC) proxy CLI RDB v2 Ray Cluster Configs

- Our Machine Learning Platform Department provides ML solutions for
a wide range of services and is continuously working to improve them. Conclusion - In this project, we adopted Ray to efficiently experiment with multi-model setups and relatively large models. - To enhance both developer productivity and security when using Ray in our environment, we developed Rayleigh, which in turn contributed to projects like Vector Forge

Thank you!!

Rayleigh: 分散処理フレームワークRayの管理プラットフォーム / Rayleigh:...

Rayleigh: 分散処理フレームワークRayの管理プラットフォーム / Rayleigh: A Management Platform for Ray Clusters

More Decks by LINEヤフーTech (LY Corporation Tech)

Other Decks in Technology

Featured

Transcript