Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

CafeGPT: Serving LLMs Like Coffee With Kubernetes

CafeGPT: Serving LLMs Like Coffee With Kubernetes

Avatar for Madhav Jivrajani

Madhav Jivrajani

December 18, 2025
Tweet

More Decks by Madhav Jivrajani

Other Decks in Programming

Transcript

  1. CafeGPT: Serving LLMs Like Coffee With Kubernetes Madhav Jivrajani &

    Kartik Ramesh Gopher credits: https://github.com/ashleymcnamara/gophers
  2. This isn’t a deep dive into LLM inference and the

    internals of how it is enabled by Kubernetes. There’s much more suited talks at this conference and others on the topic! 3
  3. We’ve been trying to educate ourselves on how many of

    these different pieces of the ecosystem fit together, and this is us hoping to share that picture with you. 4
  4. Welcome to CafeGPT! • CafeGPT is the best Cafe at

    KubeCon 2025. • We have a new customer with a drink request. 7
  5. Welcome to CafeGPT! • CafeGPT is the best Cafe at

    KubeCon 2025. • We have a new customer with a drink request. • There’s a manager and a barista to help fulfil the request! 8
  6. Welcome to CafeGPT! • The manager directs the request to

    the barista. • The barista has access to coffee machines to help fulfill the request. 10
  7. Welcome to CafeGPT! • The manager directs the request to

    the barista. • The barista has access to coffee machines to help fulfill the request. • Each coffee machine has many moving parts: the grinder, the steamer and the pressure valve. 11
  8. Welcome to CafeGPT! • The barista uses a coffee machine

    to make the drink… • … and finally serves the drink back to the customer via the manager 13
  9. Welcome to CafeGPT! • Our customer loves the coffee! •

    They loved it because it was served in a timely manner and it’s not super expensive. 14
  10. Serving In CafeGPT 18 • The drink request is the…

    request. • The manager is the router responsible for delegating the request to a barista.
  11. Serving In CafeGPT 19 • The drink request is the…

    request. • The manager is the router responsible for delegating the request to a barista. • The barista is the inference engine - using resources available to it and converting requests to coffee.
  12. Serving In CafeGPT 20 • The drink request is the…

    request. • The manager is the router responsible for delegating the request to a barista. • The barista is the inference engine - using resources available to it and converting requests to coffee. • The coffee machines are our GPUs that actually brew the responses!
  13. Let’s continue makings things a little more concrete… what does

    the lifecycle of an LLM request look like? 21
  14. LLM inference engines convert the user input into an output

    using a trained LLM on specialized hardware. Lifecycle of an LLM request
  15. KV caching speeds up LLM inference by 5x1, but it

    comes at an overhead of higher memory use, which becomes a bottleneck for concurrent processing of requests. KV caching
  16. KV caching speeds up LLM inference by 5x1, but it

    comes at an overhead of higher memory use, which becomes a bottleneck for concurrent processing of requests. For Llama 3.1 8B, with 8K context length on an NVIDIA A100 GPU, you can only serve about 24 requests per second.2 KV caching Memory available 40GB Model Weights (FP16) 16GB KV Cache per request 1GB
  17. KV caching speeds up LLM inference by 5x1, but it

    comes at an overhead of higher memory use, which becomes a bottleneck for concurrent processing of requests. For Llama 3.1 8B, with 8K context length on an NVIDIA A100 GPU, you can only serve about 24 requests per second.2 As a result, a lot of research has emerged to make better use of this bottleneck.3 KV caching Memory available 40GB Model Weights (FP16) 16GB KV Cache per request 1GB 1. https://magazine.sebastianraschka.com/p/coding-the-kv-cache-in-llms 2. https://lmcache.ai/kv_cache_calculator.html 3. https://arxiv.org/abs/2309.06180, https://lmcache.ai/tech_report.pdf
  18. Prefill vs Decode Serving at CafeGPT has two phases: 1.

    Prefill - Do the prep work. 2. Decode - Multiple iterations of small units of work. Important workload characteristics: 1. Prefill is computationally intensive, but lasts a short duration. 2. Decode is memory intensive, and lasts a long duration. 3. You don’t know when the Decode will end.
  19. Workload SLAs From a customer point of view, we want:

    1. Low Time to Coffee <-> Request latency 2. Low Time between drops <-> Inter Token Latency 3. Low Time to first drop <-> Time to first token From a provider point of view we want: 1. More Coffee per second <-> Throughput 2. High machine utilization Different customers might have different SLAs.
  20. Batching Throughput increases by over 10x due to batching.1 Request

    latency improves due to lower queue latencies.1 1. https://www.anyscale.com/blog/continuous-batching-llm-inference
  21. Batching Interference between Prefill and Decode. Batch size controls iteration

    latency. Conflict between time to first drop and time between drops!
  22. Same Infra, New Workload? • Kubernetes has become the defacto

    choice for a large percentage of companies to build their platforms on.
  23. Same Infra, New Workload? • Kubernetes has become the defacto

    choice for a large percentage of companies to build their platforms on. • Inference is an interesting new workload with unique characteristics that can be served very well* without having to reinvent the wheel.
  24. Same Infra, New Workload? • Kubernetes has become the defacto

    choice for a large percentage of companies to build their platforms on. • Inference is an interesting new workload with unique characteristics that can be served very well* without having to reinvent the wheel. • In fact, the community has been relentlessly evolving the core and ecosystem projects to better support this workload.
  25. Espresso (Smol) Models Maybe your model is small enough to

    need just one GPU. ... resources: limits: nvidia.com/gpu: 1
  26. Espresso (Smol) Models Maybe… you just need a fraction of

    a GPU. ... resources: limits: nvidia.com/mig-1g.5gb.shared: 1
  27. Maximising Hardware Efficiency However, this may not maximise hardware efficiency.

    For example, if I need 1/8th of an H100, I might be forced to run on a slice larger than what I need - leading to wastage.
  28. Maximising Hardware Efficiency I also may not need a particular

    type of GPU - for example, all I may care about is 20Gi of GPU memory, I don’t care where it comes from. ... resources: limits: nvidia.com/gpu: 1
  29. Maximising Hardware Efficiency DRA (Dynamic Resource Allocation) can help with

    that! For example, vendors may be able to partition devices on the fly (KEP-4815): ... requests: ... "device.attributes['gpu.nvidia.com'].profile == '1g.5gb'”
  30. Maximising Hardware Efficiency DRA (Dynamic Resource Allocation) can help with

    that! ... requests: ... "device.capacity['nvidia.com']. memory.compareTo(quantity('10Gi')) >= 0"”
  31. Seems Familiar? We want our cafe to be able to

    handle many different kinds of scenarios. 1. Surges in the morning and lunch, slows at night. 2. Surges of specific drinks on occasions 3. Different workloads For LLM serving, these changes manifest as conversation heavy vs coding heavy tasks, or increased requests for certain models.
  32. Let’s use Kubernetes HPA to scale our serving system! CPU

    / memory utilization ❌ Horizontal Pod Autoscaler
  33. Concurrency: ❌ Summarization: 1000 input, 100 output Code generation: 100

    input, 1000 output Traditional autoscaling metrics are not a great fit for LLM workloads. Horizontal Pod Autoscaler
  34. An increasing queue length indicates increasing that our barista can’t

    keep up with their orders. New requests will face blocking, until pending requests are completed (unknown) Queue Lengths
  35. An increasing queue length indicates increasing that our barista can’t

    keep up with their orders. New requests will face blocking, until all pending requests are completed (unknown) Autoscaling on queue lengths boosts your processing throughput and reduce queuing delays. However, it does not lower latency of your processing requests. Queue Lengths
  36. Batch Sizes If latency is a concern, keep batch sizes

    low and monitor the number of active batched tokens. Scaling on batch size, can help you scale earlier than queue buildup and still meet SLAs, but you might over-react to temporary spikes.
  37. For improving latency, scale up your Disaggregated Prefill deployments. Two

    options for scaling: 1. Scale up P:D instances keeping the ratio constant a. Good for increased requests for constant workload 2. Scale up individual P or D instances a. Good if you don’t have a lot of GPUs, or if you workload has shifted. Prefill Decode Disaggregation
  38. KV cache utilization Often, KV cache space can be the

    main bottleneck. 1. Reactive: Monitor Scale if KV cache utilization exceeds a threshold. 2. Proactive: Use your workload to estimate how much KV cache you need to serve requests while meeting SLAs Memory available 40GB Model Weights (FP16) 16GB KV Cache per request 1GB
  39. Seems Familiar? With autoscaling in place: • How do we

    load balance across these replicas?
  40. Seems Familiar? With autoscaling in place: • How do we

    load balance across these replicas? • Can we do better than round-robin?
  41. Seems Familiar? With autoscaling in place: • How do we

    load balance across these replicas? • Can we do better than round-robin? ◦ Yes! Sending requests round robin might actually result in degraded performance. ◦ You may end up with hotspots because each workload is not the same.
  42. Good Routing and Better SLAs • Your inference engine is

    caching processed tokens (KV Cache).
  43. Good Routing and Better SLAs • Your inference engine is

    caching processed tokens (KV Cache). • The router can take this information to better balance requests. https://github.com/kubernetes-sigs/gateway-api-inference-extension
  44. Good Routing and Better SLAs • You don’t want to

    route to the same instance all the time because you might overload it. • You might also want to route to another instance as an escape hatch. But which one?
  45. Good Routing and Better SLAs • You don’t want to

    route to the same instance all the time because that might become the bottleneck then. • You might also want to route to another instance as an escape hatch. • We can route to the least loaded replica! ◦ KV cache utilization ◦ Queue lengths ◦ Some custom metric that matters to you
  46. Good Routing and Better SLAs llm-d and AIBrix are two

    such ecosystem projects that help with better routing and load balancing. https://llm-d.ai/, https://aibrix.readthedocs.io/latest/
  47. Phew! That was a lot of info. Let’s zoom out

    and look at the picture we’ve been constructing. 78
  48. We hope we’ve convinced you that you don’t need to

    be proficient in language modelling and that when you try to work with these systems you can ground yourself in the fact that serving LLMs is as tractable as serving coffee!