Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How continuous batching enables 23x throughput ...

How continuous batching enables 23x throughput in LLM inference

Due to the large GPU memory footprint and compute cost of LLMs, serving dominates the compute cost for most real world applications. ML engineers often treat LLMs like "black boxes" that can only be optimized with internal changes such as quantization and custom CUDA kernels. However, this is not entirely the case. Because LLMs iteratively generate their output, and because LLM inference is often memory and not compute bound, there are surprising system-level batching optimizations that make 10x or more differences in real-world workloads.

One recent such proposed optimization is continuous batching. In this talk we’ll discuss what it is, how it works, and how it enables a 23x improvement in throughput over naive HuggingFace transformers on a production workload (3x over previous SOTA).

Anyscale

August 31, 2023
Tweet

More Decks by Anyscale

Other Decks in Programming

Transcript

  1. • Anyscale last 1.5 years, working on Ray and LLMs

    • Previously worked on communication engine for LLM training at AWS • Outside of work, I enjoy a good latte while liking hot takes on ML/AI twitter 𝕏 About me
  2. Goal: Show how continuous batching significantly reduces LLM serving costs

    • LLM inference background • Systems challenges that increase cost • How continuous batching makes such an improvement (23x!) • Benchmark results Note: most of this talk provided in our blog post “How continuous batching enables 23x throughput in LLM inference while reducing p50 latency” Reduce serving costs → enable more LLM applications
  3. LLM inference background Legend: • Yellow: prompt token • Blue:

    generated token • Red: end-of-sequence token Iterative: each forward pass generates a single token Autoregressive: generation consumes prompt tokens + previously generated tokens Completion potentially decided by model: A generated token can be the end-of-sequence token How does text generation work?
  4. Systems challenges that increase cost • Size of LLM parameters

    >> size of LLM data ◦ Llama2 70B ~ 130GB to store float16 parameters ◦ 2x A100-80GB to store, 4x+ A100-80GB to maximize throughput • Memory IO huge factor in latency ◦ For a single token, have to load 130 GB to compute cores ◦ CPU memory IO ~= 10-50 GB/s ◦ GPU memory IO ~= 2000 GB/s (A100 80GB) • High throughput requires many FLOPS ◦ CPU can do real-time generation of a single sequence ◦ GPU can do real-time generation for many sequences From the FlashAttention paper https://arxiv.org/pdf/2205.14135.pdf
  5. KV cache: transformer-specific optimization • Autoregressive generation recomputes constants K

    and V • Cache K,V to reduce recomputations • K,V are ~1MB each per token for 13B model
  6. Other optimizations • Quantization – compress parameters but reduce model

    quality ◦ Treats model like black-box • Custom CUDA kernels – e.g. FlashAttention, reduces memory IO needed ◦ Low-level, complicated • Grouped Query Attention (GQA) – modify model architecture for optimized inference ◦ Requires changes to training • Continuous Batching – modify how sequences are batched ◦ Works with any LLM!
  7. Static batching • Batching multiple sequences on GPU, aka “static

    batching” • Problem: GPU utilization drops as sequences complete Legend: • Yellow: prompt token • Blue: generated token • Red: end-of-sequence token
  8. Continuous batching Top: static batching Bottom: continuous batching Legend: •

    Yellow: prompt token • Blue: generated token • Red: end-of-sequence token
  9. Continuous batching • Continuous batching dynamically recreates batches • Fills

    GPU capacity after each token generation • As variance in sequence length increases, continuous batching increases GPU utilization
  10. Throughput experiments • Hypothesis ◦ Continuous batching performs better the

    more variance there is in sequence lengths • Frameworks • Setup – hardware/model • Setup – data • Results
  11. Throughput experiments: Frameworks Static batching • HuggingFace Pipelines (link) •

    NVIDIA FasterTransformer (link) Continuous batching • HuggingFace text-generation-inference (TGI) (link) • Ray Serve • vLLM (link)
  12. Throughput experiments: Hardware/model • 1x NVIDIA A100-40GB SXM GPU •

    Provided by Anyscale • Meta’s OPT-13B ◦ dtype=float16 → 26GB for parameters • No tensor parallelism
  13. Throughput experiments: Data • Hypothesis ◦ Continuous batching performs better

    the more variance there is in sequence lengths • How to test? ◦ Generate 1000 prompts each with 512 input tokens ◦ Generate predetermined output length for each prompt, following an exponential distribution ◦ Configure model to ignore EOS token • How to control variance in sequence lengths? ◦ Limit the random sequence lengths artificially ◦ E.g. to 32, 128, 512, and 1536 output tokens ◦ 4 experiments
  14. How does vLLM beat TGI? • Note – we ran

    experiments in June, TGI is now much closer to vLLM • TGI and vLLM both use continuous batching • vLLM uses PagedAttention – extra batch size space