Due to the large GPU memory footprint and compute cost of LLMs, serving dominates the compute cost for most real world applications. ML engineers often treat LLMs like "black boxes" that can only be optimized with internal changes such as quantization and custom CUDA kernels. However, this is not entirely the case. Because LLMs iteratively generate their output, and because LLM inference is often memory and not compute bound, there are surprising system-level batching optimizations that make 10x or more differences in real-world workloads.
One recent such proposed optimization is continuous batching. In this talk we’ll discuss what it is, how it works, and how it enables a 23x improvement in throughput over naive HuggingFace transformers on a production workload (3x over previous SOTA).