Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reconciling Accuracy, Cost, and Latency of Infe...

Reconciling Accuracy, Cost, and Latency of Inference Serving Systems

17th Cloud Control Workshop, Sweden, 2024
https://cloudresearch.org/workshops/17th/

University of Florida
Monday, Sept. 16
https://news.ece.ufl.edu/2024/08/14/seminar-pooyan-jamshidi/

Pooyan Jamshidi

June 26, 2024
Tweet

More Decks by Pooyan Jamshidi

Other Decks in Science

Transcript

  1. Reconciling Accuracy, Cost, and Latency of Inference Serving Systems Pooyan

    Jamshidi https://pooyanjamshidi.github.io/ University of South Carolina
  2. Multi-Objective Optimization with Known Constraints under Uncertainty Solutions: InfAdapter [2023]:

    Autoscaling for ML Inference IPA [2024]: Autoscaling for ML Inference Pipeline Sponge [2024]: Autoscaling for ML Inference Pipeline Dynamic SLO Problem: Different Assumptions
  3. InfAdapter [2023]: Autoscaling for ML Model Inference IPA [2024]: Autoscaling

    for ML Inference Pipeline Sponge [2024]: Autoscaling for ML Inference Pipeline with Dynamic SLO
  4. ML inference services have strict & conflicting requirements 10 Highly

    Accurate! Highly Responsive! Cost-Efficient!
  5. In ML pipelines, we can now adapt the quality of

    services, too! 18 ResNet18: Tiger ResNet152: Dog
  6. First insight: The same throughput can be achieved with different

    computing resources by switching the model variants 20
  7. 22

  8. InfAdapter: Implementation details 23 Selecting a subset of model variants,

    each having its size meeting latency requirements for the predicted workload while maximizing accuracy and minimizing resource cost
  9. InfAdapter: Experimental evaluation setup Workload: Twitter-trace sample (2022-08) Baselines: Kubernetes

    VPA and Model-Switching Used models: Resnet18, Resnet34, Resnet50, Resnet101, Resnet152 Interval adaptation: 30 seconds Kubernetes cluster: 48 Cores, 192 GiB RAM 33
  10. Takeaway 44 Model variants provide the opportunity to reduce resource

    costs while adapting to the dynamic workload. Using a set of model variants simultaneously provides higher average accuracy compared to having one variant. Inference Serving Systems should consider accuracy, latency, and cost at the same time.
  11. Takeaway 45 Model variants provide the opportunity to reduce resource

    costs while adapting to the dynamic workload. Using a set of model variants simultaneously provides higher average accuracy compared to having one variant. Inference Serving Systems should consider accuracy, latency, and cost at the same time. InfAdapter!
  12. InfAdapter [2023]: Autoscaling for ML Model Inference IPA [2024]: Autoscaling

    for ML Inference Pipeline Sponge [2024]: Autoscaling for ML Inference Pipeline with Dynamic SLO
  13. Inference Pipeline 48 Video Decoder Stream Muxer Primary Detector Object

    Tracker Secondary Classifier # Configuration Options 55 86 14 44 86
  14. 1. Industry standard 2. Used in recent research 3. Complete

    set of autoscaling, scheduling, observability tools (e.g. CPU usage) 4. APIs for changing the current AutoScaling algorithms 1. Industry standard ML server 2. Have the ability make inference graph 3. Rest and GRPC endpoints 4. Have many of the features we need like monitoring stack out of the box How to navigate Model Variants
  15. Model Serving Pipeline Is only scaling enough? ? X Snapshot

    of the System X Adaptivity to multiple objectives
  16. InfAdapter [2023]: Autoscaling for ML Model Inference IPA [2024]: Autoscaling

    for ML Inference Pipeline Sponge [2024]: Autoscaling for ML Inference Pipeline with Dynamic SLO
  17. Dynamic User -> Dynamic Network Bandwidths ˻ Users move ˻

    Fluctuations in the network bandwidths ˻ Reduced time-budget for processing requests 67 SLO network latency processing latency
  18. Dynamic User -> Dynamic Network Bandwidths ˻ Users move ˻

    Fluctuations in the network bandwidths ˻ Reduced time-budget for processing requests 68 SLO network latency processing latency
  19. Inference Serving Requirements 69 Highly Responsive! Cost-Efficient! Resource Scaling In-place

    Vertical Scaling Horizontal Scaling (end-to-end latency guarantee) (least resource consumption) (more responsive) (more cost efficient) Sponge!
  20. Vertical Scaling DL Model Profiling ˻ How much resource should

    be allocated to a DL model? ˻ Latency/batch size → linear relationship ˻ Latency/CPU allocation → inverse relationship 70
  21. 3 design choices: 1. In-place vertical scaling • Fast response

    time 2. Request reordering • High priority requests 3. Dynamic batching • Increase system utilization 75 System Design
  22. Evaluation SLO guarantees (99th percentile) with up to 20% resource

    save up compared to static resource allocation. Sponge source code: https://github.com/saeid93/sponge 76
  23. Future Directions 77 Resource Scaling In-place Vertical Scaling Horizontal Scaling

    (more responsive) (more cost efficient) Sponge! How can both scaling mechanisms be used jointly under a dynamic workload to be responsive and cost efficient while guaranteeing SLOs?
  24. Performance goals are competing and users have preferences over these

    goals The variability space (design space) of (composed) systems is exponentially increasing Systems operate in uncertain environments with imperfect and incomplete knowledge Goal: Enabling users to f ind the right quality tradeoff Lander Testbed (NASA) Turtlebot 3 (UofSC) Husky UGV (UofSC) CoBot (CMU)