Autoscaling for ML Inference IPA [2024]: Autoscaling for ML Inference Pipeline Sponge [2024]: Autoscaling for ML Inference Pipeline Dynamic SLO Problem: Different Assumptions
costs while adapting to the dynamic workload. Using a set of model variants simultaneously provides higher average accuracy compared to having one variant. Inference Serving Systems should consider accuracy, latency, and cost at the same time.
costs while adapting to the dynamic workload. Using a set of model variants simultaneously provides higher average accuracy compared to having one variant. Inference Serving Systems should consider accuracy, latency, and cost at the same time. InfAdapter!
set of autoscaling, scheduling, observability tools (e.g. CPU usage) 4. APIs for changing the current AutoScaling algorithms 1. Industry standard ML server 2. Have the ability make inference graph 3. Rest and GRPC endpoints 4. Have many of the features we need like monitoring stack out of the box How to navigate Model Variants
(more responsive) (more cost efficient) Sponge! How can both scaling mechanisms be used jointly under a dynamic workload to be responsive and cost efficient while guaranteeing SLOs?
goals The variability space (design space) of (composed) systems is exponentially increasing Systems operate in uncertain environments with imperfect and incomplete knowledge Goal: Enabling users to f ind the right quality tradeoff Lander Testbed (NASA) Turtlebot 3 (UofSC) Husky UGV (UofSC) CoBot (CMU)