Reconciling Accuracy, Cost, and Latency of Inference Serving Systems

Reconciling Accuracy, Cost, and Latency of Inference Serving Systems Pooyan
Jamshidi https://pooyanjamshidi.github.io/ University of South Carolina

Multi-Objective Optimization with Known Constraints under Uncertainty Solutions: InfAdapter [2023]:
Autoscaling for ML Inference IPA [2024]: Autoscaling for ML Inference Pipeline Sponge [2024]: Autoscaling for ML Inference Pipeline Dynamic SLO Problem: Different Assumptions

Thank you, Saeid Ghafouri! 4

InfAdapter [2023]: Autoscaling for ML Model Inference IPA [2024]: Autoscaling
for ML Inference Pipeline Sponge [2024]: Autoscaling for ML Inference Pipeline with Dynamic SLO

“More than 90% of data center compute for ML workload,
is used by inference services” 6

ML inference services have strict requirements 7 Highly Responsive!

ML inference services have strict requirements 8 Highly Responsive! Cost-Efficient!

ML inference services have strict requirements 9 Highly Accurate! Highly
Responsive! Cost-Efficient!

ML inference services have strict & conflicting requirements 10 Highly
Accurate! Highly Responsive! Cost-Efficient!

More challenge: Dynamic workload 11

Resource allocation 12

Resource allocation 17 Over Provisioning Under Provisioning

In ML pipelines, we can now adapt the quality of
services, too! 18 ResNet18: Tiger ResNet152: Dog

Quality adaptation 19

First insight: The same throughput can be achieved with different
computing resources by switching the model variants 20

Multi-models (our solution—InfAdapter) vs single-model (Model-Switching) Higher average accuracy by
using multiple model variants 21

InfAdapter: Implementation details 23 Selecting a subset of model variants,
each having its size meeting latency requirements for the predicted workload while maximizing accuracy and minimizing resource cost

InfAdapter: Formulation 24

InfAdapter: Formulation 25 Maximizing Average Accuracy

InfAdapter: Formulation 26 Maximizing Average Accuracy Minimizing Resource and Loading
Costs

InfAdapter: Formulation 27

InfAdapter: Formulation 28 Supporting incoming workload

InfAdapter: Formulation 29 Supporting incoming workload Guaranteeing end-to-end latency

InfAdapter: Design 30

InfAdapter: Experimental evaluation setup Workload: Twitter-trace sample (2022-08) Baselines: Kubernetes
VPA and Model-Switching Used models: Resnet18, Resnet34, Resnet50, Resnet101, Resnet152 Interval adaptation: 30 seconds Kubernetes cluster: 48 Cores, 192 GiB RAM 33

Workload Pattern 34

InfAdapter: P99-Latency evaluation 35

InfAdapter: Accuracy evaluation 40

41 InfAdapter: Cost evaluation

InfAdapter: Tradeoff Space 42

Takeaway 43 Inference Serving Systems should consider accuracy, latency, and
cost at the same time.

Takeaway 44 Model variants provide the opportunity to reduce resource
costs while adapting to the dynamic workload. Using a set of model variants simultaneously provides higher average accuracy compared to having one variant. Inference Serving Systems should consider accuracy, latency, and cost at the same time.

Takeaway 45 Model variants provide the opportunity to reduce resource
costs while adapting to the dynamic workload. Using a set of model variants simultaneously provides higher average accuracy compared to having one variant. Inference Serving Systems should consider accuracy, latency, and cost at the same time. InfAdapter!

46 https://github.com/reconfigurable-ml-pipeline/InfAdapter

Inference Pipeline 48 Video Decoder Stream Muxer Primary Detector Object
Tracker Secondary Classifier # Configuration Options 55 86 14 44 86

49 The Variabilities ML Pipelines

Search Space

Is only scaling enough? ?

Effect of Batching

53 Goal: Providing a flexible inference pipeline

Problem Formulation Accuracy Objective Resource Objective Batch Control

Problem Formulation Latency SLA Throughput Constraint One active model per
node

Evaluations 56

1. Industry standard 2. Used in recent research 3. Complete
set of autoscaling, scheduling, observability tools (e.g. CPU usage) 4. APIs for changing the current AutoScaling algorithms 1. Industry standard ML server 2. Have the ability make inference graph 3. Rest and GRPC endpoints 4. Have many of the features we need like monitoring stack out of the box How to navigate Model Variants

58 Evaluation https://github.com/reconfigurable-ml-pipeline/ipa

59 We compared IPA with RIM and FA2

60 Audio + QA Pipeline

61 Adaptivity to multiple objectives

62 Effect of predictor

63 Gurobi solver scalability

Full replication package is available https://github.com/recon fi gurable-ml-pipeline

Model Serving Pipeline Is only scaling enough? ? X Snapshot
of the System X Adaptivity to multiple objectives

Dynamic User -> Dynamic Network Bandwidths ˻ Users move ˻
Fluctuations in the network bandwidths ˻ Reduced time-budget for processing requests 67 SLO network latency processing latency

Dynamic User -> Dynamic Network Bandwidths ˻ Users move ˻
Fluctuations in the network bandwidths ˻ Reduced time-budget for processing requests 68 SLO network latency processing latency

Inference Serving Requirements 69 Highly Responsive! Cost-Efficient! Resource Scaling In-place
Vertical Scaling Horizontal Scaling (end-to-end latency guarantee) (least resource consumption) (more responsive) (more cost efficient) Sponge!

Vertical Scaling DL Model Profiling ˻ How much resource should
be allocated to a DL model? ˻ Latency/batch size → linear relationship ˻ Latency/CPU allocation → inverse relationship 70

Problem Formulation 71

Problem Formulation 72 Minimize resource costs

Problem Formulation 73 Limit the batch size to grow infinitely!
Minimize resource costs

Problem Formulation 74 Limit the batch size to grow infinitely!
Minimize resource costs

3 design choices: 1. In-place vertical scaling • Fast response
time 2. Request reordering • High priority requests 3. Dynamic batching • Increase system utilization 75 System Design

Evaluation SLO guarantees (99th percentile) with up to 20% resource
save up compared to static resource allocation. Sponge source code: https://github.com/saeid93/sponge 76

Future Directions 77 Resource Scaling In-place Vertical Scaling Horizontal Scaling
(more responsive) (more cost efficient) Sponge! How can both scaling mechanisms be used jointly under a dynamic workload to be responsive and cost efficient while guaranteeing SLOs?

Performance goals are competing and users have preferences over these
goals The variability space (design space) of (composed) systems is exponentially increasing Systems operate in uncertain environments with imperfect and incomplete knowledge Goal: Enabling users to f ind the right quality tradeoff Lander Testbed (NASA) Turtlebot 3 (UofSC) Husky UGV (UofSC) CoBot (CMU)

Reconciling Accuracy, Cost, and Latency of Infe...

Reconciling Accuracy, Cost, and Latency of Inference Serving Systems

More Decks by Pooyan Jamshidi

Other Decks in Science

Featured

Transcript