Upgrade to Pro — share decks privately, control downloads, hide ads and more …

vLLM meetup Tokyo

vLLM meetup Tokyo

Avatar for jpishikawa

jpishikawa

June 17, 2025
Tweet

More Decks by jpishikawa

Other Decks in Technology

Transcript

  1. Update confidential designator here Version number here V00000 Agenda Time

    Title Speaker 18:00–18:20 Opening Remarks Red Hat Brian Stevens 18:20–18:45 Intro to vLLM Red Hat Michael Goin 18:45–19:15 Deploying PLaMo2 with vLLM: A Practical Guide Preferred Networks Shinichi Hemmi 19:15–19:35 LLM Compressor Red Hat Michael Goin 19:35–19:55 llm-d Red Hat Huamin Chen 19:55–20:00 Q&A 20:00–21:00 Networking & Lightning Talks (LT)
  2. Update confidential designator here Version number here V00000 At Red

    Hat we believe the future of AI is Open – source, models, infrastructure, serving. We are on a mission to bring the power of open-source LLMs and vLLM to every enterprise on the planet. Vision Mission
  3. Update confidential designator here Version number here V00000 7 The

    world changed in November 2022 ChatGPT woke the world up to the power of generative AI
  4. Update confidential designator here Version number here V00000 8 The

    power of open There has been an explosion of capability in open models over the last 2 years Llama No OSS models Zephyr Llama 2 Mistral, Granite 2 DeepSeek-R1 Mixtral, Phi-2 Jan 2023 RedPajama, MPT, Falcon Mar 2023 May 2023 July 2023 Sept 2023 Nov 2023 Jan 2024 July 2024 Sept 2024 Nov 2024 Gemma2, Nemotron DBRX, Granite 3 Qwen2-VL Phi-3, Arctic DBRX, Phi-3 Llama 3, Qwen2 Mar 2024 May 2024 Jan 2025
  5. Update confidential designator here Version number here V00000 9 The

    power of open Open models are deployment targets today – and the trend is not slowing down • 650M downloads in 2024 • 85,000 Llama derivative models • 1B, 3B, 8B, 70B, 405B variants • Multilingual, Multimodal, Mobile • First reasoning model on par in quality with OpenAI o1 • 1-70B parameter distilled versions • Global market pandemonium? Llama R1 Headlines Models are commoditizing → many options for diverse enterprise needs
  6. Update confidential designator here Version number here V00000 10 Advantages

    of open weight models and serving stack Open models play an important role in the enterprise AI landscape • Cost ◦ Self managed infrastructure ◦ 1B-405B size - match task difficulty to model • Customization ◦ Improve accuracy and costs with task specific tuning • Control ◦ Model lifecycle (no changes to the model in place) ◦ Resources (no rate limits / API downtime) • Security ◦ Complete data privacy (no 3rd party APIs)
  7. Update confidential designator here Version number here V00000 11 Red

    Hat: Leaders in OSS GenAI inference Expertise across high performance inference and SOTA model optimizations • HPC engineering team dedicated to vLLM, 7 core vLLM committers on staff • Work on key subsystems, with a particular emphasis on fast model execution • ML engineering team builds vLLM’s optimization library llm-compressor • ML research team create pre-optimized models for deployment with vLLM Core Developers of vLLM Red Hat Community Contribution Commits By Organization
  8. Update confidential designator here Version number here V00000 12 Red

    Hat: Leaders in OSS GenAI inference Expertise across high performance inference and SOTA model optimizations Optimized Model Hub LLM Compression Tools Llama Qwen Mistral DeepSeek Gemma Phi
  9. Update confidential designator here Version number here V00000 13 Neuron

    TPU Gaudi Instinct GPU Llama Qwen DeepSeek Gemma Mistral Molmo Phi Nemotron Granite Spyre Edge Private Cloud Physical Virtual Public Cloud vLLM: The De Facto Open GenAI Inference Platform vLLM has emerged as the Linux of GenAI Inference
  10. Update confidential designator here Version number here V00000 llm-d, kubernetes

    distributed inference at scale • Prefill/decode disaggregation • KV Cache distribution, offloading and storage hierarchy • AI-aware router • Operational telemetry for production • Kubernetes-based • NIXL inference transfer library Core Features: Open source release and announcement @May Red Hat Summit together with 10 founding members. 14 Why? Distributed architecture needed for maximum efficiency and meeting varying SLOs.
  11. Update confidential designator here Version number here V00000 Intro to

    vLLM Michael Goin Principal Software Engineer, Red Hat vLLM Maintainer 15
  12. Update confidential designator here Version number here V00000 16 vLLM’s

    Goal Build the fastest and easiest-to-use open-source LLM inference & serving engine
  13. Update confidential designator here Version number here V00000 ▸ Batch

    Size > 1 & Data Center Hardwares ・ Not the same workload as on-device inference for a single user ▸ How do you? ・ Efficiently schedule requests into the next forward pass? ・ Manage KV cache context and runtime memory footprint? ・ Make sure GPUs go brrrr 17 What Problem is vLLM Solving? Production Inference Serving
  14. Update confidential designator here Version number here V00000 ▸ A

    LLM is a function to predict the next token in a sequence ・ P(X_n | X_0 … X_n-1) ▸ To generate text, we “chain together” passes through the model ・ → A single request requires multiple passes through the model ・ → A single generation request can last multiple seconds ▸ Key Challenge: How to handle multiple concurrent requests 18 Why Is This A Hard Problem?
  15. Update confidential designator here Version number here V00000 19 Challenge

    1: Batching Static Batching 🙅🙅🙅 Continuous Batching 🙏🙏🙏
  16. Update confidential designator here Version number here V00000 20 Challenge

    2: KV Caching KV Cache: Caching Key and Value vectors in self-attention saves redundant computation and accelerates decoding
  17. Update confidential designator here Version number here V00000 KV Cache:

    Caching Key and Value vectors in self-attention saves redundant computation and accelerates decoding - but takes up memory! 21 Challenge 2: KV Caching
  18. Update confidential designator here Version number here V00000 22 vLLM’s

    Original Innovation Paged Attention + Continuous Batching
  19. Update confidential designator here Version number here V00000 23 vLLM’s

    Original Innovation Paged Attention + Continuous Batching Alan Turing is a computer scientist and mathem atician renowned Logical KV blocks Request A Block Table computer scientist and mathe- matician Artificial Intelli- gence is the renowned future of tech- nology Alan Turing is a Physical KV blocks Artificial Intelli- gence is the future of tech- nology Logical KV blocks Request B Block Table
  20. Update confidential designator here Version number here V00000 24 2

    Year Journey Of vLLM vLLM has rapidly evolved from a research project to the open source default ▸ Pervasive → 100k daily installs in 2025; 50k GitHub stars ▸ Explosive Growth → 10x usage increase in 2024 ▸ Vibrant Community → 1000+ contributors
  21. Update confidential designator here Version number here V00000 26 Who

    Uses vLLM? vLLM is the de-facto inference OSS inference server with ~600k weekly installs and ~50k GitHub stars • Model as a Service: AWS, GCP, Azure, NVIDIA, … • AI in Scaled Production: Amazon, Microsoft, LinkedIn, Meta, … • Proprietary Deployments: Snowflake, Roblox, IBM, … • Foundation Model Labs: Meta, Mistral, Qwen, Cohere, … • Fine-tuning Frameworks: veRL, TRL, OpenRLHF, … • Hardware Platforms: NVIDIA, AMD, Google, Intel, ARM, …
  22. Update confidential designator here Version number here V00000 27 Why

    vLLM For Performance? vLLM implements the key optimizations for fast inference Inference Optimizations To make your models faster Distributed Inference To deploy large models efficiently
  23. Update confidential designator here Version number here V00000 Reduced storage

    & memory footprints ▸ E.g.) 100B model (BFloat16 → 200GB / FP8 → 100GB) 1. Weight Quantization 2. Activation Quantization 3. KV Cache Quantization 28 Quantization in vLLM Use low bit precisions (e.g., FP8, INT8, FP4) to store and compute Faster linear layers ▸ Compute Speedups Reduced KV cache footprint & faster attention ▸ Crucial for long context workloads
  24. Update confidential designator here Version number here V00000 29 Impact

    of Quantization Quantization enables more tokens for fixed hardware
  25. Update confidential designator here Version number here V00000 A chat

    between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. User: Hello! Request A Request B 30 Automatic Prefix Caching Re-use KV cache blocks across requests! A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. User: How are you?
  26. Update confidential designator here Version number here V00000 31 Impact

    of Automatic Prefix Caching Prefix caching improves time-to-first-token by skipping prefill
  27. Update confidential designator here Version number here V00000 32 Speculative

    Decoding Accelerate decoding phase with speculation - variety of methods.
  28. Update confidential designator here Version number here V00000 33 Impact

    of Speculative Decoding Spec decoding enables better latency in bandwidth bound regimes
  29. Update confidential designator here Version number here V00000 34 vLLM

    Combines All Optimizations Together Without Optimizations
  30. Update confidential designator here Version number here V00000 35 vLLM

    API (1): LLM class from vllm import LLM # Example prompts. prompts = ["Hello, my name is", "The capital of France is"] # Create an LLM with HF model name. llm = LLM(model="meta-llama/Meta-Llama-3.1-8B") # Generate texts from the prompts. outputs = llm.generate(prompts) # also llm.chat(messages)] A Python interface for offline batched inference
  31. Update confidential designator here Version number here V00000 36 vLLM

    API (2): OpenAI-compatible server $ vllm serve meta-llama/Meta-Llama-3.1-8B $ curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Meta-Llama-3.1-8B", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0 }' A FastAPI-based server for online serving Server Client Available routes are: • /tokenize • /detokenize • /v1/models • /v1/chat/completions • /v1/completions • /v1/embeddings • /v1/audio/transcriptions • /pooling • /classify • /score • /rerank • /metrics • …and more!
  32. Update confidential designator here Version number here V00000 37 vLLM

    API (3): Embeddable LLMEngine A Python library with the full power of vLLM in your framework
  33. Update confidential designator here Version number here V00000 38 https://github.com/vllm-project/vllm

    https://twitter.com/vllm_project https://opencollective.com/vllm https://slack.vllm.ai $ pip install vllm Roadmap Q2 • 🚧 V1 Migration • 🚧 Large Scale Serving • ✅ Post-Training (RLHF) • ✅ Performance Enhancement on Various Hardware Q3 • 🌐 Advance next frontier: NVL72 Rack Scale Serving • 🏎 Specialization: low latency, high throughput, etc • 🧱 Operability: observability, scaling, customization Today https://docs.vllm.ai/
  34. Update confidential designator here Version number here V00000 LLM Compressor

    Michael Goin Principal Software Engineer, Red Hat vLLM Maintainer 39
  35. Update confidential designator here Version number here V00000 40 LLM

    Compressor Compressing your LLMs for optimized deployment with vLLM https://github.com/vllm-project/llm-compressor
  36. Update confidential designator here Version number here V00000 42 Numerics

    101 ▸ LLMs are a series of matrix multiplications of learned “parameters” ▸ Each “parameter” is a value represented by a series of some number of bits ▸ The more bits used to represent a parameter the more detail we can express: ・ Dynamic range (min-max value) ・ Precision (values close to zero)
  37. Update confidential designator here Version number here V00000 43 Quantization

    101 Quantization aims to reduces the precision of a model’s weights (+ possibly activations) from high precision (e.g. BF16 training) to low precision formats (e.g. INT8 / FP8) without dropping model quality. Quantization Targets
  38. Update confidential designator here Version number here V00000 Reduced storage

    & memory footprints ▸ E.g.) 100B model (BFloat16 → 200GB / FP8 → 100GB) 1. Weight Quantization 2. Activation Quantization 3. KV Cache Quantization 44 Quantization in vLLM Use low bit precisions (e.g., FP8, INT8, FP4) to store and compute Faster linear layers ▸ Compute Speedups Reduced KV cache footprint & faster attention ▸ Crucial for long context workloads
  39. Update confidential designator here Version number here V00000 45 Weight

    Quantization (W8A16, W4A16, WNA16) ▸ Reduce GPU RAM requirements by squeezing down the parameters into lower bit precisions ▸ Reduce data movement at the expense of some extra compute to upconvert for each forward pass ▸ This is likely what you already know: GPTQ, AWQ, bitsandbytes, GGUF, etc
  40. Update confidential designator here Version number here V00000 46 Activation

    Quantization (W8A8, W4A8, W4A4) ▸ Weight-only quantization is not sufficient for speedups once under load (long prefills or batching). ▸ Quantizing weights and activations means we can finally use low precision tensor cores!
  41. Update confidential designator here Version number here V00000 47 Why

    Quantize Weights and Activations? Quantization enables more tokens for fixed hardware
  42. Update confidential designator here Version number here V00000 48 Accurate

    Compression with Fine-grained Quantization Not all quantization is the same, quality is important! Pass@1 score and standard deviation for quantized models on the popular reasoning benchmarks
  43. Update confidential designator here Version number here V00000 49 Get

    Started with Quantization in vLLM Find pre-optimized models at hf.co/RedHatAI Pre-Optimized Model Hub LLM Compressor Llama Qwen Mistral DeepSeek Gemma Phi → red.ht/optimized-models → red.ht/llm-compressor
  44. Update confidential designator here Version number here V00000 50 Quantizing

    a Model with LLM Compressor Picking the right scheme and model Picking a Compression Scheme ▸ W4A16 (low batch size, memory bound) ▸ INT8 or FP8 (high batch size, compute bound) ▸ KV Cache Quantization (large context lengths) ▸ 2of4 Sparsity (high batch size, smaller model size)
  45. Update confidential designator here Version number here V00000 52 Applying

    Algorithms with LLM Compressor Defining a recipe
  46. Update confidential designator here Version number here V00000 53 Applying

    Algorithms with LLM Compressor Compressing the model Compressing… (1/29): Calibrating: 100%|██████████| 512/512 [00:34<00:00, 14.69it/s] 2025-02-11T23:14:03.464012-0500 | on_sequential_batch_end | INFO - Quantizing model.layers.0.self_attn.q_proj using 512 samples 2025-02-11T23:14:06.270327-0500 | compress | METRIC - time 2.81s 2025-02-11T23:14:06.270540-0500 | compress | METRIC - error 1197.95 2025-02-11T23:14:06.271204-0500 | compress | METRIC - GPU 0 | usage: 11.20% | total memory: 85 GB 2025-02-11T23:14:06.271554-0500 | compress | METRIC - Compressed module size: 4.77696 MB 2025-02-11T23:14:06.271732-0500 | on_sequential_batch_end | INFO - Quantizing model.layers.0.self_attn.k_proj using 512 samples 2025-02-11T23:14:06.921636-0500 | compress | METRIC - time 0.65s 2025-02-11T23:14:06.921845-0500 | compress | METRIC - error 221.82 2025-02-11T23:14:06.922063-0500 | compress | METRIC - GPU 0 | usage: 11.20% | total memory: 85 GB 2025-02-11T23:14:06.922356-0500 | compress | METRIC - Compressed module size: 0.79616 MB 2025-02-11T23:14:06.922503-0500 | on_sequential_batch_end | INFO - Quantizing model.layers.0.self_attn.v_proj using 512 samples 2025-02-11T23:14:07.571902-0500 | compress | METRIC - time 0.65s 2025-02-11T23:14:07.572131-0500 | compress | METRIC - error 28.66 2025-02-11T23:14:07.572351-0500 | compress | METRIC - GPU 0 | usage: 11.20% | total memory: 85 GB 2025-02-11T23:14:07.572643-0500 | compress | METRIC - Compressed module size: 0.79616 MB 2025-02-11T23:14:07.572780-0500 | on_sequential_batch_end | INFO - Quantizing model.layers.0.self_attn.o_proj using 512 samples 2025-02-11T23:14:08.230037-0500 | compress | METRIC - time 0.66s 2025-02-11T23:14:08.230248-0500 | compress | METRIC - error 11.95 2025-02-11T23:14:10.214038-0500 | compress | METRIC - GPU 0 | usage: 11.20% | total memory: 85 GB 2025-02-11T23:14:10.214890-0500 | compress | METRIC - Compressed module size: 4.773888 MB 2025-02-11T23:14:10.215273-0500 | on_sequential_batch_end | INFO - Quantizing model.layers.0.mlp.gate_proj using 512 samples 2025-02-11T23:14:10.959970-0500 | compress | METRIC - time 0.74s 2025-02-11T23:14:10.960274-0500 | compress | METRIC - error 1085.49 2025-02-11T23:14:10.960643-0500 | compress | METRIC - GPU 0 | usage: 11.20% | total memory: 85 GB 2025-02-11T23:14:10.960939-0500 | compress | METRIC - Compressed module size: 27.84768 MB 2025-02-11T23:14:10.961100-0500 | on_sequential_batch_end | INFO - Quantizing model.layers.0.mlp.up_proj using 512 samples 2025-02-11T23:14:11.674930-0500 | compress | METRIC - time 0.71s 2025-02-11T23:14:11.675241-0500 | compress | METRIC - error 708.51 2025-02-11T23:14:11.675612-0500 | compress | METRIC - GPU 0 | usage: 11.20% | total memory: 85 GB 2025-02-11T23:14:11.675932-0500 | compress | METRIC - Compressed module size: 27.84768 MB 2025-02-11T23:14:11.676104-0500 | on_sequential_batch_end | INFO - Quantizing model.layers.0.mlp.down_proj using 512 samples 2025-02-11T23:14:15.634541-0500 | compress | METRIC - time 3.96s 2025-02-11T23:14:15.635367-0500 | compress | METRIC - error 22.47 2025-02-11T23:14:15.635735-0500 | compress | METRIC - GPU 0 | usage: 11.95% | total memory: 85 GB 2025-02-11T23:14:15.636034-0500 | compress | METRIC - Compressed module size: 27.84768 MB (1/29): Propagating: 100%|██████████| 512/512 [00:32<00:00, 15.81it/s] (2/29): Calibrating: 100%|██████████| 512/512 [00:03<00:00, 147.09it/s] 2025-02-11T23:14:51.507693-0500 | on_sequential_batch_end | INFO - Quantizing model.layers.1.self_attn.q_proj using 512 samples 2025-02-11T23:14:52.171517-0500 | compress | METRIC - time 0.66s 2025-02-11T23:14:52.171817-0500 | compress | METRIC - error 281.88 2025-02-11T23:14:52.804425-0500 | compress | METRIC - GPU 0 | usage: 11.95% | total memory: 85 GB 2025-02-11T23:14:52.805252-0500 | compress | METRIC - Compressed module size: 4.77696 MB 2025-02-11T23:14:52.805484-0500 | on_sequential_batch_end | INFO - Quantizing model.layers.1.self_attn.k_proj using 512 samples
  47. Update confidential designator here Version number here V00000 54 Deploying

    to vLLM Native support via compressed tensors Compressed Tensors (Packing + Bitmasks)
  48. Update confidential designator here Version number here V00000 55 Composing

    Algorithms with LLM Compressor SmoothQuant and SparseGPT
  49. Update confidential designator here Version number here V00000 56 The

    LLM Compressor Ecosystem LLM Compressor and compressed tensors Compressed Tensors Integration with Transformers Research-backed Quantized Models Ready to Deploy Integrations with SFT Frameworks like Axolotl Adopted by foundation model labs!
  50. Update confidential designator here Version number here V00000 ▸ 2x

    end-to-end latency speedup for Llama 3.1 70b model ▸ 50% less GPUs ・ Dense (left) running on two A100 80GB ・ Compressed (right) running on one A100 80GB ▸ 99.14% accuracy recovery 58 Demo: Compressed Model Inference Acceleration in vLLM
  51. Update confidential designator here Version number here V00000 llm-d: Kubernetes-Native

    Distributed Inference At Scale Huamin Chen, Ph.D. Distinguished Engineer, Red Hat 59
  52. Update confidential designator here Version number here V00000 The problems

    to solve ▸ Kubernetes-native distributed inference serving stack ▸ Optimal performance per dollar across hardware accelerators ▸ Advanced optimizations beyond traditional load balancing: ・ Prefix caching ・ Disaggregated serving ・ Intelligent routing ▸ Key differentiator: Leverages unique LLM inference characteristics for 3x better performance 60 llm-d Overview
  53. Update confidential designator here Version number here V00000 61 The

    Problem: LLM Inference is Different Traditional HTTP LLM Inference Short-lived, uniform requests Expensive requests with high variance Uniform latency requirements Diverse QoS needs (ms to hours) Simple round-robin works Cache locality matters Each replica equal Disaggregation opportunities
  54. Update confidential designator here Version number here V00000 ▸ Request

    variance: Input/output tokens create load imbalances ▸ Cache locality: Multi-turn conversations and RAG benefit from prefix caching ▸ Resource optimization: Prefill vs decode have different requirements ▸ QoS diversity: Code completion (ms) vs batch processing (hours) 62 LLM Inference Challenges
  55. Update confidential designator here Version number here V00000 vLLM-Optimized Inference

    Scheduler ▸ Prefix cache-aware load balancing ▸ KV cache utilization awareness ▸ Session affinity for multi-turn conversations 64 llm-d Innovations Disaggregated Serving ▸ Prefill: Compute-intensive, parallelizable ▸ Decode: Memory bandwidth-bound, latency-sensitive ▸ Independent scaling and optimization Hierarchical Prefix Caching ▸ Multi-tier: Local HBM, host memory, remote storage ▸ Cross-instance KV transfer capabilities Variant Autoscaling (Roadmap) ▸ Traffic and hardware-aware ▸ Workload-specific QoS optimization
  56. Update confidential designator here Version number here V00000 ▸ Repository:

    llm-d-deployer ▸ Purpose: Single Helm chart installation ▸ Key Features: ・ One-command deployment ・ Configurable feature toggles ・ Built-in metrics (Prometheus/Grafana) ・ Development & production configs ./llmd-installer.sh 65 Component Deep Dive: Deployer
  57. Update confidential designator here Version number here V00000 ▸ Built

    on Envoy + Gateway API Inference Extension ▸ Pluggable Components: ・ Filters: Model compatibility, resource limits, health ・ Scorers: Session affinity, prefix cache hits, load balancing ・ Scrapers: Memory usage, active sessions, cache stats 66 Component Deep Dive: Inference Scheduler
  58. Update confidential designator here Version number here V00000 67 Inference

    Scheduler: Available Scorers Scorer Purpose Benefit Session-aware Prefers pods from same user session Conversation continuity Prefix-aware Routes based on prompt prefix matching Cache hit optimization KV Cache-aware Optimizes for KV cache reuse Memory efficiency Load-aware Avoids overloaded pods Even distribution
  59. Update confidential designator here Version number here V00000 ▸ Global

    KV Cache State Management ▸ Core Components: ・ kvcache.Indexer: Main orchestrator ・ LRU Prefix Store: Tokenized prefix storage ・ KVBlock to Pod Index: Cache location mapping ・ Tokenizers Pool: Multi-model tokenization 68 Component Deep Dive: KV Cache Manager
  60. Update confidential designator here Version number here V00000 ▸ Prefill

    and decode deployments ▸ Inference pool and model defined by Gateway API Inference Extension (GIE) ▸ Endpoint picker (EPP) deployment and service ▸ Relevant RBAC permissions 69 Component Deep Dive: Model Service
  61. Update confidential designator here Version number here V00000 ▸ Separation

    of Compute Phases ▸ Benefits: ・ Flexibility: Per-request optimization ・ Resource Efficiency: Specialized workers ・ Scalability: Independent scaling 70 Component Deep Dive: Disaggregated Prefill/Decode
  62. Update confidential designator here Version number here V00000 ▸ Inference

    Simulator ・ OpenAI-compatible API endpoints - Configurable response timing ・ Development & testing without GPUs ▸ Benchmarking Suite ・ Comprehensive performance validation ・ Regression prevention ・ Load testing capabilities 71 Supporting Components
  63. Update confidential designator here Version number here V00000 ▸ Multi-turn

    Conversation ・ Challenge: Redundant computation in conversation history ・ Solution: Session-aware routing with cached context ・ Result: Reduced TTFT for subsequent turns ▸ RAG (Retrieval-Augmented Generation) ・ Challenge: Long prompts with retrieved documents ・ Solution: Prefix-aware routing leverages cached embeddings ・ Result: Faster knowledge-intensive task responses ▸ Agentic Computing ・ Challenge: Iterative patterns with shared context ・ Solution: Combined session affinity + prefix caching ・ Result: Reduced latency for reasoning chains 72 Use Cases & Applications
  64. Update confidential designator here Version number here V00000 ▸ Code

    Completion ・ Challenge: Ultra-low latency with shared codebase context ・ Solution: KV cache-aware routing to relevant workers ・ Result: Sub-second interactive coding responses ▸ Batch Processing ・ Challenge: Cost optimization for latency-tolerant workloads ・ Solution: Variant autoscaling optimizes resource utilization ・ Result: Lower costs while meeting SLAs 73 Use Cases & Applications (cont’d)
  65. Update confidential designator here Version number here V00000 Phase 1:

    Core Infrastructure ▸ Inference Gateway integration ▸ Basic prefix and load-aware routing ▸ Disaggregated P/D serving prototype ▸ KV cache manager foundation Phase 2: Advanced Optimizations ▸ Enhanced KV cache hierarchy ▸ Improved disaggregation protocols ▸ Cross-accelerator support (TPU, AMD, Intel) ▸ Advanced metrics and observability Phase 3: Production Hardening ▸ Variant autoscaling ▸ Multi-model support ▸ Advanced security features ▸ Enterprise integration patterns 74 Implementation Roadmap
  66. Update confidential designator here Version number here V00000 Quick Installation

    ▸ git clone https://github.com/llm-d/llm-d-deployer.git ▸ cd llm-d-deployer/quickstart ▸ ./llmd-installer.sh Configurations ▸ Choose LLM Model ▸ Choose Router Algorithms ▸ Choose P/D Replica 75 Getting Started
  67. Update confidential designator here Version number here V00000 76 ←

    Envoy Gateway ← Inference Router ← vLLM with qwen3 0.6B ← Endpoint Picker for qwen3 ← Model Service for qwen3
  68. Update confidential designator here Version number here V00000 77 Benchmarking

    6x llm-d (vLLM) nodes: - meta-llama/Llama-3.1-70b-Insturct, TP=2 on NVIDIA A100-80GB
  69. Update confidential designator here Version number here V00000 78 Benchmarking

    IGW and vLLM in llm-d collaborated on prefix-cache aware routing, building on IGW’s KV cache-aware load balancing. Evaluated on 2×8×H100 nodes using LMbenchmark with long-input/short-output workloads. Focus: stress KV cache reuse and test routing decision quality.
  70. Update confidential designator here Version number here V00000 ▸ Key

    Contributors: ・ CoreWeave ・ Google Cloud ・ IBM Research ・ NVIDIA ・ Red Hat ▸ Communication: ・ 💬 Slack Workspace ・ 💭 GitHub Discussions ・ 📧 Google Group ▸ Open Development: ・ Apache 2.0 ・ Upstream-first ・ Component-based ownership 79 Community & Governance
  71. Update confidential designator here Version number here V00000 Contribute to

    key vLLM features Comment and review PRs that are interesting to you. Join the discussion on RFCs. Check out “good first issue” tags. Join vLLM Developer Slack Ask questions and engage with us via Slack. Join here. Engage with vLLM Office Hours Red Hat hosts bi-weekly vLLM Office Hours every other Thursday. We share project updates, dig into exciting topics, answer questions, and more. All sessions are recorded. You can engage with the slides and recording here. We are exploring ways to bring this to your region at appropriate times at the end of the year. Get involved with the vLLM Community 84