not scalable • Object Storage ◦ Scalable but not fast ▪ We're using HDDs as backend of our object storage • Node Local Storage (NVMe) ◦ Very fast but the storage is not globally available, and not scalable ▪ If the workload is moved to different compute node, the data is unreachable. On-premise Storages for the dataset loading 8 Compute Node A Deep Neural Network Compute Node B Deep Neural Network Preempt and Re-scheduling Moved to node B unreachable Icon pack by Icons8 - https://icons8.com Cache
Compute Node Deep Neural Network Compute Node Deep Neural Network Cloud of Node Local Storage Capacity-optimized Performance-optimized Icon pack by Icons8 - https://icons8.com
just returns local files Cloud Native ✓ SCS runs on Kubernetes Shared-nothing architecture ✓ Scalable Position as a Cache ✓ It’s just “Cache” and not “Persistent Storage” Overview of Simple Cache Service 12
Service with Topology Aware Hints Load Balancing in Layer 7 Envoy Proxy with Consistent Hashing Network Zone B GET /objects/A GET /objects/A GET /objects/B Cache B Cache A User 18
w/ the token in Auth Header Authorization (1/2) User Pods 3. Verify the token by TokenReview API ✓ “Aud as expected?” “Valid until?” “Pod is still alive?” ✓ Resolve the NS of the source from the SA username ➡ Namespace-level authorization can be implemented 19
It can read / write / list Local filesystem, S3 compatible object storage, HDFS, … Read File in Object Storage with PFIO GET /000.jpg 200 OK 22 Object Storage Icon pack by Icons8 - https://icons8.com train_with_scs.py import pfio import torch fs = pfio.v2.from_url(zip_url) # fs is Local filesystem like object, actually S3 file_url = "000.jpg" with fs.open(file_url) as fp: image = fp.read() # image = torch.Tensor(image)... # loss = model(image)
in SCS first, then try origin later if data is not exist • At first, the desired data is not stored in SCS, therefore PFIO will put it to SCS Transparent Object Storage Cache by PFIO GET /000.jpg 404 Not Found PUT /000.jpg 201 Created Object Storage GET /000.jpg 200 OK 23 train_with_scs.py import pfio import torch fs = pfio.v2.from_url(zip_url, http_cache=scs_url) # fs is Local filesystem like object, actually S3 file_url = "000.jpg" with fs.open(file_url) as fp: image = fp.read() # image = torch.Tensor(image)... # loss = model(image) Icon pack by Icons8 - https://icons8.com
in SCS first, then try origin later if data is not exist • If the desired data is stored in SCS, we can skip accessing Object Storage Transparent Object Storage Cache by PFIO GET /000.jpg 200 OK PUT /000.jpg 201 Created GET /000.jpg 200 OK 24 train_with_scs.py import pfio import torch fs = pfio.v2.from_url(zip_url, http_cache=scs_url) # fs is Local filesystem like object, actually S3 file_url = "000.jpg" with fs.open(file_url) as fp: image = fp.read() # image = torch.Tensor(image)... # loss = model(image) Object Storage SKIP ! Icon pack by Icons8 - https://icons8.com
to implement ◦ URL Mappings ◦ from origin key to SCS bucket/key ◦ AuthN/AuthZ if needed ◦ Other necessary features Features not to implement ✓ Storage management ◦ Cache Eviction ◦ Capacity Control GET /000.jpg 404 Not Found PUT /000.jpg 201 Created Origin Service GET /000.jpg 200 OK GET /000.jpg 200 OK 25 ① ② ③ ⑤ ④ e.g. Container Image Layer
use local NVMe drives ◦ Also, User Pods will be scheduled to all nodes to use all accelerators • Where to deploy Envoy? ◦ We deploy Envoy to all nodes to reduce inter-zone traffic of Pod/Envoy. ◦ Inter-zone traffic of Envoy/SCS is unavoidable in that case. Where to deploy Envoy? Spine Switch Leaf Switch Leaf Switch Node Node Node Node Node Node Spine Switch 31
Node Node Node Node Node • Pod/Envoy Traffic ◦ Perfect / No network traffic • Envoy load balance ◦ Bad / No distribution of traffic ◦ When some node use SCS heavily, the Envoy's cpu load become high • Pod/Envoy Traffic ◦ Moderate / In-zone network traffic only • Envoy load balance ◦ Moderate / Distribute traffic in a zone ◦ When some node use SCS heavily, Envoy's cpu load is distributed among zone We use Topology Aware Routing to improve Envoy's cpu load balance 32 Internal Traffic Policy
SCS consistently ◦ When we put an object to the N-th SCS, we want to get it from the N-th SCS. • The easiest way to achieve that: ◦ Manage a mapping from bucket/object to id of backend ◦ Mapping should be sharded… ▪ Introduce a distributed MDS ▪ Too complicated solution for us • We don't manage a mapping explicitly ◦ Use hash(bucket + "/" + object) to choose a backend server Load Balancing of Keys (Bucket and Object) 34
The length of arc in RING corresponds to the ratio of responsibilities ▪ Backend 3 is 1.5x responsibility of Backend 4 ◦ It affect the performance ! ▪ B3's cpu usage is 1.5x of B4's • Because B3 is 1.5x busier than B4 • May result the longer latency ▪ The lifetime of B3 data is 1.5x shorter than B4 data • Because the cache capacity is the same • More possibility of deletion We want to see the consistent resource usage and lifetime Key distribution matters! 36 Backend 1 Backend 3 Backend 2 Backend 4 RING_HASH