Distributed Cache Empowers AI/ML Workloads on Kubernetes Cluster / KubeCon + CloudNativeCon North America 2024

Distributed Cache Empowers AI/ML Workloads on Kubernetes Cluster Yuichiro Ueno,
Toru Komatsu (Preferred Networks, Inc.) KubeCon North America 2024 1

Speakers @utam0k @y1r96 Toru Komatsu Preferred Networks, Inc. 2 Yuichiro
Ueno Preferred Networks, Inc.

1. Background: AI / ML Workloads ✓ Storage Requirements and
Kubernetes Usage 2. Our system: Simple Cache Service 3. Use case 4. Deploy Considerations ✓ How to optimize network trafﬁc and key distribution to achieve higher performance ✓ The number of SCS in production 5. Summary Today's topic 3

Distributed Cache Empowers AI/ML Workloads on Kubernetes Cluster Yuichiro Ueno,
Toru Komatsu (Preferred Networks, Inc.) KubeCon North America 2024 4

Training of Machine Learning Models 5 Compute Node Deep Neural
Network Compute Node Deep Neural Network Data Samples Dataset Icon pack by Icons8 - https://icons8.com

• Network File System (NFS) with hostPath ◦ Fast but
not scalable • Object Storage ◦ Scalable but not fast ▪ We're using HDDs as backend of our object storage • Node Local Storage (NVMe) ◦ Very fast but the storage is not globally available, and not scalable ▪ If the workload is moved to different compute node, the data is unreachable. On-premise Storages for the dataset loading 12 Compute Node A Deep Neural Network Compute Node B Deep Neural Network Preempt and Re-scheduling Moved to node B unreachable Icon pack by Icons8 - https://icons8.com Cache

Best hierarchical storage for AI/ML workload ? 13 Object Storage
Compute Node Deep Neural Network Compute Node Deep Neural Network Cloud of Node Local Storage Capacity-optimized Performance-optimized Icon pack by Icons8 - https://icons8.com

Best hierarchical storage development with: 14 ✓ Topology-Aware Routing ✓
Informer for Pod Discovery ✓ Token Review API ✓ Consistent Hashing ✓ xDS API Cloud of Node Local Storage Icon pack by Icons8 - https://icons8.com

Overview of Simple Cache Service 15

Simple ✓ Simple HTTP REST API（GET & PUT） ✓ It
just returns local ﬁles Cloud Native ✓ SCS runs on Kubernetes Shared-nothing architecture ✓ Scalable Position as a Cache ✓ It’s just “Cache” and not “Persistent Storage” Overview of Simple Cache Service 16

How to use SCS # Upload àpple.jpg` and save as
àpple` object in `prj-foobar` bucket. $ curl -H "Authorization: Bearer $(cat /token)" \ -X PUT \ http://cache.cache-service.svc/v1/objects/prj-foobar/apple \ --data-binary @apple.jpg # Download àpple` object in `prj-foobar` bucket $ curl -H "Authorization: Bearer $(cat /token)" \ -X GET \ http://cache.cache-service.svc/v1/objects/prj-foobar/apple 17

`apple` object in `prj-foobar` bucket. $ curl -H "Authorization: Bearer $(cat /token)" \ -X PUT \ http://cache.cache-service.svc/v1/objects/prj-foobar/apple \ --data-binary @apple.jpg # Download `apple` object in `prj-foobar` bucket $ curl -H "Authorization: Bearer $(cat /token)" \ -X GET \ http://cache.cache-service.svc/v1/objects/prj-foobar/apple bucket object 18

`apple` object in `prj-foobar` bucket. $ curl -H "Authorization: Bearer $(cat /token)" \ -X PUT \ http://cache.cache-service.svc/v1/objects/prj-foobar/apple \ --data-binary @apple.jpg # Download `apple` object in `prj-foobar` bucket $ curl -H "Authorization: Bearer $(cat /token)" \ -X GET \ http://cache.cache-service.svc/v1/objects/prj-foobar/apple Bound Service Account Token 19

Overall Architecture (1/2) Load Balancing in Layer 4 Service with
Topology Aware Hints Load Balancing in Layer 7 Envoy Proxy with Consistent Hashing User Pods Cache 20

Overall Architecture (2/2) .sqlite Application Cache Data Meta Data 21
Icon pack by Icons8 - https://icons8.com

Shared-nothing Architecture Network Zone A Load Balancing in Layer 4
Service with Topology Aware Hints Load Balancing in Layer 7 Envoy Proxy with Consistent Hashing Network Zone B GET /objects/A GET /objects/A GET /objects/B Cache B Cache A User 22

1. Mount the Bound SA Token 2. Make the request
w/ the token in Auth Header Authorization (1/2) User Pods 3. Verify the token by TokenReview API ✓ “Aud as expected?” “Valid until?” “Pod is still alive?” ✓ Resolve the NS of the source from the SA username ➡ Namespace-level authorization can be implemented 23

Authorization (2/2) "Bucket": [ { "Name": "public", "Public": true, "BucketQuota":
"100Gi" }, { "Name": "kubecon", "Public": false, "BucketQuota": "500Gi", "AllowNamespaces" : [ "prj-kubernetes", "user-utam0k", ] } ] Public Bucket Private Bucket Based on Namespace Selector 24

Unfortunately, storage is a limited resource… 😭 LRU(Least Recently Used)
Strategy Total Limit When each bucket reaches its capacity limit, object deletion begins based on LRU 25

Use case 26

Case 1 SCS as a Cache for Slower Storage ✓
Make faster AI/ML Workloads ! Case 2 SCS as a Backend for Yet Another Cache ✓ Make faster startup of AI/ML Workloads ! Use case of SCS in AI / ML Workloads 27

Make faster AI/ML Workloads ! Case 2 SCS as a Backend for Yet Another Cache ✓ Make faster startup of AI/ML Workloads ! Use case of SCS in AI / ML Workloads 28 →

PFIO is an I/O abstraction library developed by us •
It can read / write / list Local ﬁlesystem, S3 compatible object storage, HDFS, … Read File in Object Storage with PFIO GET /000.jpg 200 OK 29 Object Storage Icon pack by Icons8 - https://icons8.com train_with_scs.py import pfio import torch fs = pfio.v2.from_url(zip_url) # fs is Local filesystem like object, actually S3 file_url = "000.jpg" with fs.open(file_url) as fp: image = fp.read() # image = torch.Tensor(image)... # loss = model(image)

PFIO supports transparent cache mechanism • It automatically checks data
in SCS ﬁrst, then try origin later if data is not exist • At ﬁrst, the desired data is not stored in SCS, therefore PFIO will put it to SCS Transparent Object Storage Cache by PFIO GET /000.jpg 404 Not Found PUT /000.jpg 201 Created Object Storage GET /000.jpg 200 OK 30 train_with_scs.py import pfio import torch fs = pfio.v2.from_url(zip_url, http_cache=scs_url) # fs is Local filesystem like object, actually S3 file_url = "000.jpg" with fs.open(file_url) as fp: image = fp.read() # image = torch.Tensor(image)... # loss = model(image) Icon pack by Icons8 - https://icons8.com

PFIO supports transparent cache mechanism • It automatically checks data
in SCS ﬁrst, then try origin later if data is not exist • If the desired data is stored in SCS, we can skip accessing Object Storage Transparent Object Storage Cache by PFIO GET /000.jpg 200 OK PUT /000.jpg 201 Created GET /000.jpg 200 OK 31 train_with_scs.py import pfio import torch fs = pfio.v2.from_url(zip_url, http_cache=scs_url) # fs is Local filesystem like object, actually S3 file_url = "000.jpg" with fs.open(file_url) as fp: image = fp.read() # image = torch.Tensor(image)... # loss = model(image) Object Storage SKIP ! Icon pack by Icons8 - https://icons8.com

Make faster AI/ML Workloads ! Case 2 SCS as a Backend for Yet Another Cache ✓ Make faster startup of AI/ML Workloads ! Use case of SCS in AI / ML Workloads 32 →

Type 1 Container Images It includes a lot of dependencies
◦ Compilers, CUDA (runtime and library), MPI, and PyTorch As a result, our all-in-one container image is 30+ GB Weekly cache hit rate to SCS is 94.3% in our cluster Type 2 Models Large Language Model is larger and larger ! ◦ GB ~ TB size Our researchers want to evaluate the performance of public LLMs Characteristics Ephemeral, Large, and Hot Many users access the same ﬁle Cache mechanism works well Other large ﬁles in AI / ML Workloads 33

Implementing Yet Another Cache using SCS Yet Another Cache Features
to implement ◦ URL Mappings ◦ from origin key to SCS bucket/key ◦ AuthN/AuthZ if needed ◦ Other necessary features Features not to implement ✓ Storage management ◦ Cache Eviction ◦ Capacity Control GET /000.jpg 404 Not Found PUT /000.jpg 201 Created Origin Service GET /000.jpg 200 OK GET /000.jpg 200 OK 34 ① ② ③ ⑤ ④ e.g. Container Image Layer

Deploying SCS 35

Q1 How can we optimize the network traffic ? Deploy
Considerations User Pods Q2 How can we configure Envoy to route the traffic ? 36

Company: Preferred Networks • Provides ML models like LLMs, and
solutions for industries • Uses own on-premise infrastructure to provide solutions Infrastructure • 3+ Kubernetes Clusters • 400+ Kubernetes Nodes • 30000+ CPU Cores • 320+ TiB Memory • 2000+ GPUs • Our AI Accelerator: MN-Core™ ◦ HW: RTL, Board/Server Design ◦ SW: Driver, Device Plugin, Compiler Background: Our computing infrastructure 38 Our Infrastructure

Network Zone D Network Zone C Network Zone B Network
Zone A Network topology of our data center: CLOS network Background: Data Center Network Spine Switch Spine Switch Leaf Switch Leaf Switch Leaf Switch Leaf Switch Node Node Node Node Node Node Node Node Node Node Node Node External / Super Spine Inter-zone Networking (oversubscribed) In-zone Networking 39

• Assumptions ◦ SCS is deployed to all nodes to
use local NVMe drives ◦ Also, User Pods will be scheduled to all nodes to use all accelerators • Where to deploy Envoy? ◦ We deploy Envoy to all nodes to reduce inter-zone trafﬁc of Pod/Envoy. ◦ Inter-zone trafﬁc of Envoy/SCS is unavoidable in that case. Where to deploy Envoy? Spine Switch Leaf Switch Leaf Switch Node Node Node Node Node Node Spine Switch 43

Reducing inter-zone traffic by K8s Service Topology Aware Routing Node
Node Node Node Node Node • Pod/Envoy Traffic ◦ Perfect / No network traffic • Envoy load balance ◦ Bad / No distribution of traffic ◦ When some node use SCS heavily, the Envoy's cpu load become high • Pod/Envoy Traffic ◦ Moderate / In-zone network traffic only • Envoy load balance ◦ Moderate / Distribute traffic in a zone ◦ When some node use SCS heavily, Envoy's cpu load is distributed among zone We use Topology Aware Routing to improve Envoy's cpu load balance 49 Internal Traffic Policy

• We want to route the trafﬁc from Envoy to
SCS consistently ◦ When we put an object to the N-th SCS, we want to get it from the N-th SCS. • The easiest way to achieve that: ◦ Manage a mapping from bucket/object to id of backend ◦ Mapping should be sharded… ▪ Introduce a distributed MDS ▪ Too complicated solution for us • We don't manage a mapping explicitly ◦ Use hash(bucket + "/" + object) to choose a backend server Load Balancing of Keys (Bucket and Object) 53

• Use (hash % number-of-backends) as backend id ? ◦
When the number of backends changes, almost every keys remaps ▪ Typical example: Node Failure / Installation ◦ More sophisticated way -> Consistent Hashing ▪ Bound of remapped keys is keys/backends • Two Consistent Hashing algorithms in Envoy/lb_policy Consistent Hashing 55 Backend 1 Backend 3 Backend 2 Backend 4 Hash (10) Backend 1 • {1, 6, 10} Backend 2 • {2, 5, 11} Backend 3 • {7, 8, 9} Backend 4 • {3, 4, 12} Hash (10) RING_HASH MAGLEV

• Load balance of keys is also very important ◦
The length of arc in RING corresponds to the ratio of responsibilities ▪ Backend 3 is 1.5x responsibility of Backend 4 ◦ It affect the performance ! ▪ B3's cpu usage is 1.5x of B4's • Because B3 is 1.5x busier than B4 • May result the longer latency ▪ The lifetime of B3 data is 1.5x shorter than B4 data • Because the cache capacity is the same • More possibility of deletion We want to see the consistent resource usage and lifetime Key distribution matters! 56 Backend 1 Backend 3 Backend 2 Backend 4 RING_HASH

RING_HASH vs MAGLEV -> We use MAGLEV 58 RING_HASH MAGLEV
Objects Count per node Backend 1 Backend 3 Backend 2 Backend 4 Hash (10) Backend 1 • {1, 6, 10} Backend 2 • {2, 5, 11} Backend 3 • {7, 8, 9} Backend 4 • {3, 4, 12} Hash (10) RING_HASH MAGLEV Load imbalance up to 1.5x No load imbalance

The number of SCS in production 59

API calls / sec 60 Peak: 37k requests / sec

Numbers of SCS: Aggregated Trafﬁc 61 Peak: 75.1 GiB/s throughput

• Peak performance of the last 30 days ◦ 37k
requests / sec ◦ 75.1 GiB/s throughput • We achieved this performance in the production environment with 55 Backend Servers with 82.5 TB NVMe Storage in total • Usage ◦ 268M Objects ◦ Response code statistics: ▪ 200 OK (GET): 96.2 % ▪ 404 Not Found (GET): 0.9 % ▪ 201 Created (PUT): 2.9 % Numbers of SCS 62

Summary 63

SCS Summary Features Feature 1 Shared-nothing: Consistent Hashing with Envoy
Feature 2 AuthN / AuthZ: Bound SA Token with TokenReview API Feature 3 Transparent Cache: PFIO Use cases in the Real World Case 1 AI/ML Dataset Loading Case 2 Large Model Deployment and Container Images Optimization Techniques Tech 1 CLOS Network optimization: Internal Trafﬁc Policy / Tech 1 Topology-Aware Routing Tech 2 Consistent Hashing Algorithm: RING_HASH / MAGLEV Supported by Cloud Native Technologies: Kubernetes and Envoy Internship members: @naoki9911, @ciffelia, @takonomura 64

Distributed Cache Empowers AI/ML Workloads on K...

Distributed Cache Empowers AI/ML Workloads on Kubernetes Cluster / KubeCon + CloudNativeCon North America 2024

More Decks by Preferred Networks

Other Decks in Technology

Featured

Transcript