vLLM meetup Tokyo

Update conﬁdential designator here Version number here V00000 06/16/2025 vLLM
Meetup - Tokyo 1

Update conﬁdential designator here Version number here V00000 Agenda Time
Title Speaker 18:00–18:20 Opening Remarks Red Hat Brian Stevens 18:20–18:45 Intro to vLLM Red Hat Michael Goin 18:45–19:15 Deploying PLaMo2 with vLLM: A Practical Guide Preferred Networks Shinichi Hemmi 19:15–19:35 LLM Compressor Red Hat Michael Goin 19:35–19:55 llm-d Red Hat Huamin Chen 19:55–20:00 Q&A 20:00–21:00 Networking & Lightning Talks (LT)

Update conﬁdential designator here Version number here V00000 Opening Remarks
Brian Stevens SVP & AI CTO, Red Hat 5

Update conﬁdential designator here Version number here V00000 At Red
Hat we believe the future of AI is Open – source, models, infrastructure, serving. We are on a mission to bring the power of open-source LLMs and vLLM to every enterprise on the planet. Vision Mission

Update conﬁdential designator here Version number here V00000 7 The
world changed in November 2022 ChatGPT woke the world up to the power of generative AI

power of open There has been an explosion of capability in open models over the last 2 years Llama No OSS models Zephyr Llama 2 Mistral, Granite 2 DeepSeek-R1 Mixtral, Phi-2 Jan 2023 RedPajama, MPT, Falcon Mar 2023 May 2023 July 2023 Sept 2023 Nov 2023 Jan 2024 July 2024 Sept 2024 Nov 2024 Gemma2, Nemotron DBRX, Granite 3 Qwen2-VL Phi-3, Arctic DBRX, Phi-3 Llama 3, Qwen2 Mar 2024 May 2024 Jan 2025

power of open Open models are deployment targets today – and the trend is not slowing down • 650M downloads in 2024 • 85,000 Llama derivative models • 1B, 3B, 8B, 70B, 405B variants • Multilingual, Multimodal, Mobile • First reasoning model on par in quality with OpenAI o1 • 1-70B parameter distilled versions • Global market pandemonium? Llama R1 Headlines Models are commoditizing → many options for diverse enterprise needs

Update confidential designator here Version number here V00000 10 Advantages
of open weight models and serving stack Open models play an important role in the enterprise AI landscape • Cost ◦ Self managed infrastructure ◦ 1B-405B size - match task difficulty to model • Customization ◦ Improve accuracy and costs with task specific tuning • Control ◦ Model lifecycle (no changes to the model in place) ◦ Resources (no rate limits / API downtime) • Security ◦ Complete data privacy (no 3rd party APIs)

Update conﬁdential designator here Version number here V00000 11 Red
Hat: Leaders in OSS GenAI inference Expertise across high performance inference and SOTA model optimizations • HPC engineering team dedicated to vLLM, 7 core vLLM committers on staff • Work on key subsystems, with a particular emphasis on fast model execution • ML engineering team builds vLLM’s optimization library llm-compressor • ML research team create pre-optimized models for deployment with vLLM Core Developers of vLLM Red Hat Community Contribution Commits By Organization

Update conﬁdential designator here Version number here V00000 12 Red
Hat: Leaders in OSS GenAI inference Expertise across high performance inference and SOTA model optimizations Optimized Model Hub LLM Compression Tools Llama Qwen Mistral DeepSeek Gemma Phi

Update conﬁdential designator here Version number here V00000 13 Neuron
TPU Gaudi Instinct GPU Llama Qwen DeepSeek Gemma Mistral Molmo Phi Nemotron Granite Spyre Edge Private Cloud Physical Virtual Public Cloud vLLM: The De Facto Open GenAI Inference Platform vLLM has emerged as the Linux of GenAI Inference

Update conﬁdential designator here Version number here V00000 llm-d, kubernetes
distributed inference at scale • Prefill/decode disaggregation • KV Cache distribution, offloading and storage hierarchy • AI-aware router • Operational telemetry for production • Kubernetes-based • NIXL inference transfer library Core Features: Open source release and announcement @May Red Hat Summit together with 10 founding members. 14 Why? Distributed architecture needed for maximum efﬁciency and meeting varying SLOs.

Update conﬁdential designator here Version number here V00000 Intro to
vLLM Michael Goin Principal Software Engineer, Red Hat vLLM Maintainer 15

Update conﬁdential designator here Version number here V00000 16 vLLM’s
Goal Build the fastest and easiest-to-use open-source LLM inference & serving engine

Update conﬁdential designator here Version number here V00000 ▸ Batch
Size > 1 & Data Center Hardwares ･ Not the same workload as on-device inference for a single user ▸ How do you? ･ Efﬁciently schedule requests into the next forward pass? ･ Manage KV cache context and runtime memory footprint? ･ Make sure GPUs go brrrr 17 What Problem is vLLM Solving? Production Inference Serving

Update conﬁdential designator here Version number here V00000 ▸ A
LLM is a function to predict the next token in a sequence ･ P(X_n | X_0 … X_n-1) ▸ To generate text, we “chain together” passes through the model ･ → A single request requires multiple passes through the model ･ → A single generation request can last multiple seconds ▸ Key Challenge: How to handle multiple concurrent requests 18 Why Is This A Hard Problem?

Update conﬁdential designator here Version number here V00000 19 Challenge
1: Batching Static Batching 🙅🙅🙅 Continuous Batching 🙏🙏🙏

Update conﬁdential designator here Version number here V00000 20 Challenge
2: KV Caching KV Cache: Caching Key and Value vectors in self-attention saves redundant computation and accelerates decoding

Update conﬁdential designator here Version number here V00000 KV Cache:
Caching Key and Value vectors in self-attention saves redundant computation and accelerates decoding - but takes up memory! 21 Challenge 2: KV Caching

Original Innovation Paged Attention + Continuous Batching

Original Innovation Paged Attention + Continuous Batching Alan Turing is a computer scientist and mathem atician renowned Logical KV blocks Request A Block Table computer scientist and mathe- matician Artiﬁcial Intelli- gence is the renowned future of technology Alan Turing is a Physical KV blocks Artiﬁcial Intelli- gence is the future of technology Logical KV blocks Request B Block Table

Update conﬁdential designator here Version number here V00000 24 2
Year Journey Of vLLM vLLM has rapidly evolved from a research project to the open source default ▸ Pervasive → 100k daily installs in 2025; 50k GitHub stars ▸ Explosive Growth → 10x usage increase in 2024 ▸ Vibrant Community → 1000+ contributors

Update conﬁdential designator here Version number here V00000 25 From
this base, we have built…

Update conﬁdential designator here Version number here V00000 26 Who
Uses vLLM? vLLM is the de-facto inference OSS inference server with ~600k weekly installs and ~50k GitHub stars • Model as a Service: AWS, GCP, Azure, NVIDIA, … • AI in Scaled Production: Amazon, Microsoft, LinkedIn, Meta, … • Proprietary Deployments: Snowﬂake, Roblox, IBM, … • Foundation Model Labs: Meta, Mistral, Qwen, Cohere, … • Fine-tuning Frameworks: veRL, TRL, OpenRLHF, … • Hardware Platforms: NVIDIA, AMD, Google, Intel, ARM, …

Update conﬁdential designator here Version number here V00000 27 Why
vLLM For Performance? vLLM implements the key optimizations for fast inference Inference Optimizations To make your models faster Distributed Inference To deploy large models efficiently

Update conﬁdential designator here Version number here V00000 Reduced storage
& memory footprints ▸ E.g.) 100B model (BFloat16 → 200GB / FP8 → 100GB) 1. Weight Quantization 2. Activation Quantization 3. KV Cache Quantization 28 Quantization in vLLM Use low bit precisions (e.g., FP8, INT8, FP4) to store and compute Faster linear layers ▸ Compute Speedups Reduced KV cache footprint & faster attention ▸ Crucial for long context workloads

Update conﬁdential designator here Version number here V00000 29 Impact
of Quantization Quantization enables more tokens for ﬁxed hardware

Update confidential designator here Version number here V00000 A chat
between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. User: Hello! Request A Request B 30 Automatic Prefix Caching Re-use KV cache blocks across requests! A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. User: How are you?

of Automatic Prefix Caching Prefix caching improves time-to-first-token by skipping prefill

Update conﬁdential designator here Version number here V00000 32 Speculative
Decoding Accelerate decoding phase with speculation - variety of methods.

of Speculative Decoding Spec decoding enables better latency in bandwidth bound regimes

Update conﬁdential designator here Version number here V00000 34 vLLM
Combines All Optimizations Together Without Optimizations

API (1): LLM class from vllm import LLM # Example prompts. prompts = ["Hello, my name is", "The capital of France is"] # Create an LLM with HF model name. llm = LLM(model="meta-llama/Meta-Llama-3.1-8B") # Generate texts from the prompts. outputs = llm.generate(prompts) # also llm.chat(messages)] A Python interface for ofﬂine batched inference

API (2): OpenAI-compatible server $ vllm serve meta-llama/Meta-Llama-3.1-8B $ curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Meta-Llama-3.1-8B", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0 }' A FastAPI-based server for online serving Server Client Available routes are: • /tokenize • /detokenize • /v1/models • /v1/chat/completions • /v1/completions • /v1/embeddings • /v1/audio/transcriptions • /pooling • /classify • /score • /rerank • /metrics • …and more!

API (3): Embeddable LLMEngine A Python library with the full power of vLLM in your framework

Update conﬁdential designator here Version number here V00000 38 https://github.com/vllm-project/vllm
https://twitter.com/vllm_project https://opencollective.com/vllm https://slack.vllm.ai $ pip install vllm Roadmap Q2 • 🚧 V1 Migration • 🚧 Large Scale Serving • ✅ Post-Training (RLHF) • ✅ Performance Enhancement on Various Hardware Q3 • 🌐 Advance next frontier: NVL72 Rack Scale Serving • 🏎 Specialization: low latency, high throughput, etc • 🧱 Operability: observability, scaling, customization Today https://docs.vllm.ai/

Update conﬁdential designator here Version number here V00000 LLM Compressor
Michael Goin Principal Software Engineer, Red Hat vLLM Maintainer 39

Update conﬁdential designator here Version number here V00000 40 LLM
Compressor Compressing your LLMs for optimized deployment with vLLM https://github.com/vllm-project/llm-compressor

Optimize Your Model?

Update conﬁdential designator here Version number here V00000 42 Numerics
101 ▸ LLMs are a series of matrix multiplications of learned “parameters” ▸ Each “parameter” is a value represented by a series of some number of bits ▸ The more bits used to represent a parameter the more detail we can express: ･ Dynamic range (min-max value) ･ Precision (values close to zero)

Update conﬁdential designator here Version number here V00000 43 Quantization
101 Quantization aims to reduces the precision of a model’s weights (+ possibly activations) from high precision (e.g. BF16 training) to low precision formats (e.g. INT8 / FP8) without dropping model quality. Quantization Targets

Update conﬁdential designator here Version number here V00000 Reduced storage
& memory footprints ▸ E.g.) 100B model (BFloat16 → 200GB / FP8 → 100GB) 1. Weight Quantization 2. Activation Quantization 3. KV Cache Quantization 44 Quantization in vLLM Use low bit precisions (e.g., FP8, INT8, FP4) to store and compute Faster linear layers ▸ Compute Speedups Reduced KV cache footprint & faster attention ▸ Crucial for long context workloads

Update conﬁdential designator here Version number here V00000 45 Weight
Quantization (W8A16, W4A16, WNA16) ▸ Reduce GPU RAM requirements by squeezing down the parameters into lower bit precisions ▸ Reduce data movement at the expense of some extra compute to upconvert for each forward pass ▸ This is likely what you already know: GPTQ, AWQ, bitsandbytes, GGUF, etc

Update confidential designator here Version number here V00000 46 Activation
Quantization (W8A8, W4A8, W4A4) ▸ Weight-only quantization is not sufficient for speedups once under load (long prefills or batching). ▸ Quantizing weights and activations means we can finally use low precision tensor cores!

Quantize Weights and Activations? Quantization enables more tokens for ﬁxed hardware

Update conﬁdential designator here Version number here V00000 48 Accurate
Compression with Fine-grained Quantization Not all quantization is the same, quality is important! Pass@1 score and standard deviation for quantized models on the popular reasoning benchmarks

Update conﬁdential designator here Version number here V00000 49 Get
Started with Quantization in vLLM Find pre-optimized models at hf.co/RedHatAI Pre-Optimized Model Hub LLM Compressor Llama Qwen Mistral DeepSeek Gemma Phi → red.ht/optimized-models → red.ht/llm-compressor

Update conﬁdential designator here Version number here V00000 50 Quantizing
a Model with LLM Compressor Picking the right scheme and model Picking a Compression Scheme ▸ W4A16 (low batch size, memory bound) ▸ INT8 or FP8 (high batch size, compute bound) ▸ KV Cache Quantization (large context lengths) ▸ 2of4 Sparsity (high batch size, smaller model size)

Update conﬁdential designator here Version number here V00000 51 Applying
Algorithms with LLM Compressor Loading models

Algorithms with LLM Compressor Deﬁning a recipe

Update conﬁdential designator here Version number here V00000 54 Deploying
to vLLM Native support via compressed tensors Compressed Tensors (Packing + Bitmasks)

Update conﬁdential designator here Version number here V00000 55 Composing
Algorithms with LLM Compressor SmoothQuant and SparseGPT

LLM Compressor Ecosystem LLM Compressor and compressed tensors Compressed Tensors Integration with Transformers Research-backed Quantized Models Ready to Deploy Integrations with SFT Frameworks like Axolotl Adopted by foundation model labs!

Update conﬁdential designator here Version number here V00000 57 Demo:
Compress LLMs and Evaluate Performance

Update conﬁdential designator here Version number here V00000 ▸ 2x
end-to-end latency speedup for Llama 3.1 70b model ▸ 50% less GPUs ･ Dense (left) running on two A100 80GB ･ Compressed (right) running on one A100 80GB ▸ 99.14% accuracy recovery 58 Demo: Compressed Model Inference Acceleration in vLLM

Update conﬁdential designator here Version number here V00000 llm-d: Kubernetes-Native
Distributed Inference At Scale Huamin Chen, Ph.D. Distinguished Engineer, Red Hat 59

Update conﬁdential designator here Version number here V00000 The problems
to solve ▸ Kubernetes-native distributed inference serving stack ▸ Optimal performance per dollar across hardware accelerators ▸ Advanced optimizations beyond traditional load balancing: ･ Preﬁx caching ･ Disaggregated serving ･ Intelligent routing ▸ Key differentiator: Leverages unique LLM inference characteristics for 3x better performance 60 llm-d Overview

Problem: LLM Inference is Different Traditional HTTP LLM Inference Short-lived, uniform requests Expensive requests with high variance Uniform latency requirements Diverse QoS needs (ms to hours) Simple round-robin works Cache locality matters Each replica equal Disaggregation opportunities

Update confidential designator here Version number here V00000 ▸ Request
variance: Input/output tokens create load imbalances ▸ Cache locality: Multi-turn conversations and RAG benefit from prefix caching ▸ Resource optimization: Prefill vs decode have different requirements ▸ QoS diversity: Code completion (ms) vs batch processing (hours) 62 LLM Inference Challenges

Update conﬁdential designator here Version number here V00000 63 Architecture
Overview

Update confidential designator here Version number here V00000 vLLM-Optimized Inference
Scheduler ▸ Prefix cache-aware load balancing ▸ KV cache utilization awareness ▸ Session affinity for multi-turn conversations 64 llm-d Innovations Disaggregated Serving ▸ Prefill: Compute-intensive, parallelizable ▸ Decode: Memory bandwidth-bound, latency-sensitive ▸ Independent scaling and optimization Hierarchical Prefix Caching ▸ Multi-tier: Local HBM, host memory, remote storage ▸ Cross-instance KV transfer capabilities Variant Autoscaling (Roadmap) ▸ Traffic and hardware-aware ▸ Workload-specific QoS optimization

Update confidential designator here Version number here V00000 ▸ Repository:
llm-d-deployer ▸ Purpose: Single Helm chart installation ▸ Key Features: ･ One-command deployment ･ Configurable feature toggles ･ Built-in metrics (Prometheus/Grafana) ･ Development & production configs ./llmd-installer.sh 65 Component Deep Dive: Deployer

Update confidential designator here Version number here V00000 ▸ Built
on Envoy + Gateway API Inference Extension ▸ Pluggable Components: ･ Filters: Model compatibility, resource limits, health ･ Scorers: Session affinity, prefix cache hits, load balancing ･ Scrapers: Memory usage, active sessions, cache stats 66 Component Deep Dive: Inference Scheduler

Update confidential designator here Version number here V00000 67 Inference
Scheduler: Available Scorers Scorer Purpose Benefit Session-aware Prefers pods from same user session Conversation continuity Prefix-aware Routes based on prompt prefix matching Cache hit optimization KV Cache-aware Optimizes for KV cache reuse Memory efficiency Load-aware Avoids overloaded pods Even distribution

Update confidential designator here Version number here V00000 ▸ Global
KV Cache State Management ▸ Core Components: ･ kvcache.Indexer: Main orchestrator ･ LRU Prefix Store: Tokenized prefix storage ･ KVBlock to Pod Index: Cache location mapping ･ Tokenizers Pool: Multi-model tokenization 68 Component Deep Dive: KV Cache Manager

Update confidential designator here Version number here V00000 ▸ Prefill
and decode deployments ▸ Inference pool and model defined by Gateway API Inference Extension (GIE) ▸ Endpoint picker (EPP) deployment and service ▸ Relevant RBAC permissions 69 Component Deep Dive: Model Service

Update confidential designator here Version number here V00000 ▸ Separation
of Compute Phases ▸ Benefits: ･ Flexibility: Per-request optimization ･ Resource Efficiency: Specialized workers ･ Scalability: Independent scaling 70 Component Deep Dive: Disaggregated Prefill/Decode

Update conﬁdential designator here Version number here V00000 ▸ Inference
Simulator ･ OpenAI-compatible API endpoints - Conﬁgurable response timing ･ Development & testing without GPUs ▸ Benchmarking Suite ･ Comprehensive performance validation ･ Regression prevention ･ Load testing capabilities 71 Supporting Components

Update confidential designator here Version number here V00000 ▸ Multi-turn
Conversation ･ Challenge: Redundant computation in conversation history ･ Solution: Session-aware routing with cached context ･ Result: Reduced TTFT for subsequent turns ▸ RAG (Retrieval-Augmented Generation) ･ Challenge: Long prompts with retrieved documents ･ Solution: Prefix-aware routing leverages cached embeddings ･ Result: Faster knowledge-intensive task responses ▸ Agentic Computing ･ Challenge: Iterative patterns with shared context ･ Solution: Combined session affinity + prefix caching ･ Result: Reduced latency for reasoning chains 72 Use Cases & Applications

Update conﬁdential designator here Version number here V00000 ▸ Code
Completion ･ Challenge: Ultra-low latency with shared codebase context ･ Solution: KV cache-aware routing to relevant workers ･ Result: Sub-second interactive coding responses ▸ Batch Processing ･ Challenge: Cost optimization for latency-tolerant workloads ･ Solution: Variant autoscaling optimizes resource utilization ･ Result: Lower costs while meeting SLAs 73 Use Cases & Applications (cont’d)

Update conﬁdential designator here Version number here V00000 Phase 1:
Core Infrastructure ▸ Inference Gateway integration ▸ Basic preﬁx and load-aware routing ▸ Disaggregated P/D serving prototype ▸ KV cache manager foundation Phase 2: Advanced Optimizations ▸ Enhanced KV cache hierarchy ▸ Improved disaggregation protocols ▸ Cross-accelerator support (TPU, AMD, Intel) ▸ Advanced metrics and observability Phase 3: Production Hardening ▸ Variant autoscaling ▸ Multi-model support ▸ Advanced security features ▸ Enterprise integration patterns 74 Implementation Roadmap

Update conﬁdential designator here Version number here V00000 Quick Installation
▸ git clone https://github.com/llm-d/llm-d-deployer.git ▸ cd llm-d-deployer/quickstart ▸ ./llmd-installer.sh Conﬁgurations ▸ Choose LLM Model ▸ Choose Router Algorithms ▸ Choose P/D Replica 75 Getting Started

Update conﬁdential designator here Version number here V00000 76 ←
Envoy Gateway ← Inference Router ← vLLM with qwen3 0.6B ← Endpoint Picker for qwen3 ← Model Service for qwen3

Update conﬁdential designator here Version number here V00000 77 Benchmarking
6x llm-d (vLLM) nodes: - meta-llama/Llama-3.1-70b-Insturct, TP=2 on NVIDIA A100-80GB

Update conﬁdential designator here Version number here V00000 78 Benchmarking
IGW and vLLM in llm-d collaborated on preﬁx-cache aware routing, building on IGW’s KV cache-aware load balancing. Evaluated on 2×8×H100 nodes using LMbenchmark with long-input/short-output workloads. Focus: stress KV cache reuse and test routing decision quality.

Update conﬁdential designator here Version number here V00000 ▸ Key
Contributors: ･ CoreWeave ･ Google Cloud ･ IBM Research ･ NVIDIA ･ Red Hat ▸ Communication: ･ 💬 Slack Workspace ･ 💭 GitHub Discussions ･ 📧 Google Group ▸ Open Development: ･ Apache 2.0 ･ Upstream-ﬁrst ･ Component-based ownership 79 Community & Governance

Update conﬁdential designator here Version number here V00000 80 Thank
you!

Update confidential designator here Version number here V00000 Contribute to
key vLLM features Comment and review PRs that are interesting to you. Join the discussion on RFCs. Check out “good first issue” tags. Join vLLM Developer Slack Ask questions and engage with us via Slack. Join here. Engage with vLLM Office Hours Red Hat hosts bi-weekly vLLM Office Hours every other Thursday. We share project updates, dig into exciting topics, answer questions, and more. All sessions are recorded. You can engage with the slides and recording here. We are exploring ways to bring this to your region at appropriate times at the end of the year. Get involved with the vLLM Community 84

vLLM meetup Tokyo

vLLM meetup Tokyo

More Decks by jpishikawa

Other Decks in Technology

Featured

Transcript