vLLM Community Meetup Tokyo #3 オープニングトーク

vLLM Community Meetup Tokyo #3 March 5, 2026

Version number here V00000 最近のvLLMプロジェクトについてのアップデート vLLM meetup オープニングトーク Red Hat
AI SSA Junpei Ishikawa

Update confidential designator here Version number here V00000 Broad Model
Support (>100 arches) 8 The High-Throughput and Memory-Efficient open-source inference engine for LLMs What is ? DeepSeek Wide Hardware Support CUDA ROCm Gaudi/XPU TPU Neuron CPU $ uv pip install vllm --torch-backend=auto $ vllm serve deepseek-ai/DeepSeek-V3.1 -tp 8 https://github.com/vllm-project/vllm Llama GPT-OSS Qwen Gemma Granite Ascend Phi Cohere Mistral >2000 Contributors from >50 major companies Diverse Project Ecosystem LLM Compressor vLLM Semantic Router Metal MACA RBLN Spyre MLU Kunlun Flexible Device Parallelism Tensor, Pipeline, Expert, Data, Context Parallel Disagg Prefill/Decode, Disagg Encoder Most Popular LLM Serving Engine • 70K+ GitHub stars, 800+ PRs/month • 500K++ GPUs deployed 24/7 • 2K+ contributors, 10K+ members in slack.vllm.ai

Update confidential designator here Version number here V00000 Broad Model
Support (>100 arches) 9 The High-Throughput and Memory-Efficient inference and serving engine for LLMs What is ? DeepSeek Wide Hardware Support CUDA ROCm Gaudi/XPU TPU Neuron CPU $ uv pip install vllm --torch-backend=auto $ vllm serve deepseek-ai/DeepSeek-V3.1 -tp 8 https://github.com/vllm-project/vllm Llama GPT-OSS Qwen Gemma Granite Ascend Phi Cohere Most Popular LLM Serving Engine • 65K+ GitHub stars, 800+ PRs/month • 500K++ GPUs deployed 24/7 • 2K+ contributors, 10K+ members in slack.vllm.ai Mistral >2000 Contributors from >50 major companies Diverse Project Ecosystem LLM Compressor vLLM Semantic Router Metal MACA RBLN Spyre MLU Kunlun Flexible Device Parallelism Tensor, Pipeline, Expert, Data, Context Parallel Disagg Prefill/Decode, Disagg Encoder GitHub Octoverse 2025 https://github.blog/news-insights/octoverse/octoverse-a-new-developer-joins-github-every-second-as-ai-leads-typescript-to-1/

Update conﬁdential designator here Version number here V00000 10 vLLM
usage sample growth: 80% in Q1/Q2, 30% in Q3 Warning: this is a (small) biased subset that did not opt-out of usage data reporting.

API (1): LLM class from vllm import LLM # Example prompts. prompts = ["Hello, my name is", "The capital of France is"] # Create an LLM with HF model name. llm = LLM(model="openai/gpt-oss-20b") # Generate texts from the prompts. outputs = llm.generate(prompts) # also llm.chat(messages)] A Python interface for ofﬂine batched inference

API (2): OpenAI-compatible server $ vllm serve openai/gpt-oss-20b $ curl http://localhost:8000/v1/responses \ -H "Content-Type: application/json" \ -d '{ "model": "openai/gpt-oss-20b", "input": "tell me a 20 words story about a cat", }' A FastAPI-based server for online serving Server Client

Update conﬁdential designator here Version number here V00000 13 Universal
Drop-in Compatible Server $ vllm serve openai/gpt-oss-20b 🆕 Multi-Modality Input API 🆕 Rerank, Pooling and Embedding API 🆕 Responses API 🆕 SageMaker API 🆕 Anthropic API 🆕 API for RL: Tokens-in-tokens-out 🆕 gRPC API 🆕 Omni Modality API

Models: Qwen, Llama, Mistral, Gpt-OSS, DeepSeek Warning: this is a (small) biased subset that did not opt-out of usage data reporting.

Hardware: AMD GPU single digit % instances Warning: this is a (small) biased subset that did not opt-out of usage data reporting.

Update confidential designator here Version number here V00000 vLLM's new
KV Offloading Connector https://blog.vllm.ai/2026/01/08/kv-offloading-connector.html Up to 9x increase in throughput on H100, 2x-22x reduction in TTFT for cache hits Try with: --kv_offloading_backend native --kv_offloading_size <GB> VRAM RAM Async ▸ VRAMに保存されたKV Cacheを⾮同期でCPU RAMにオフロード ▸ v0.11.0で導⼊、v0.14.0から--kv_offloading_sizeで指定可能に ▸ KV Offloading Connectorを介してVRAMのKV CacheをRAMに移動 ▸ LRU（Least Recently Used）に従い、直近で使われていないキャッシュを⾮同期で移動 ▸ v0.12.0で、これまでレイヤごと分かれていたKV Cache Blockのレイアウトを変更し、モデルごとにまとめたことでIOを改善 ▸ ハイブリッドKV Cache (GPT-OSS, Qwen3.5, etc.)への最適化はこれから GPU CPU [RFC]: KV Offloading Roadmap #33689 https://github.com/vllm-project/vllm/issues/33689 16

Update conﬁdential designator here Version number here V00000 17 Changing
vLLM’s Memory Layout ▸0.11.0以前はKV Blockがモデルのレイヤごとに分かれて保存されていた ▸0.12.0でこれらが変更され、レイヤを跨いでKV Blockを持つように ▸Blockサイズが⼤きくなったことでCPU<->GPU間のIOが改善 0.11.0 0.12.0 Llama-3.2-1B-Instruct 16 KB 0.5 MB Llama-3.1-8B-Instruct 32 KB 2 MB Llama-3.1-70B-Instruct (TP=4) 8 KB 1.25 MB Qwen/Qwen2.5-3B-Instruct 8 KB 0.44 MB Qwen/Qwen3-0.6B 32 KB 1.75 MB Qwen/Qwen2.5-7B-Instruct 16 KB 0.87 MB

Update confidential designator here Version number here V00000 18 vLLM's
new KV Offloading Connector https://blog.vllm.ai/2026/01/08/kv-offloading-connector.html Up to 9x increase in throughput on H100, 2x-22x reduction in TTFT for cache hits Try with: --kv_offloading_backend native --kv_offloading_size <GB> 18 CPU Hit Rateが上がるとトークンスループットが向上ブロックサイズが⼤きくなるとDMAでのスループットが向上

Project Ecosystem Various Projects for LLM Inference Optimization LLM Compressor

Update conﬁdential designator here Version number here V00000 20 Speculative
Decoding Accelerate decoding phase with speculation - variety of methods. ドラフトモデルで複数先のトークンを出⼒、ターゲットモデルで検証することでTPOT(Time per output token)を改善

Update conﬁdential designator here Version number here V00000 When to
use Speculative Decoding? 21 Not a universal solution Speculative Decodingはコンピューティング能⼒の余剰がある場合はメモリ帯域の節約に寄与するが、コンピューティング能⼒の余剰が無い場合はロスに繋がる Use it for: ▸ RAG、エージェント、アシスタントなどレイテンシの要件が厳しいアプリケーション Avoid it if: ▸ バッチなどのthroughput-heavyな処理 ▸ 入力が長く、アウトプットが短い処理 (TTFTではなくTPOTを改善するため)

Update conﬁdential designator here Version number here V00000 ▸ vLLMで実⾏可能なドラフトモデル作成のた
めのライブラリ ▸ ドラフトモデルのトレーニング、他ツールで作成されたEagle3モデルをvLLMで実⾏可能な形式に変換 ▸ Hugging Face互換フォーマット https://github.com/vllm-project/speculators 22

Update conﬁdential designator here Version number here V00000 https://huggingface.co/collections/RedHatAI/speculator-models 23

Update conﬁdential designator here Version number here V00000 24 Multi
Token Prediction with Qwen3.5 https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3.5.html Try with: --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' ▸ MTP（Multi Token Prediction）により、ある時点から複数先のトークンを出⼒ ▸ 次のIterationで出⼒トークンを検証することで、ドラフトモデルを使わずにSpeculative Decodingと同様のことを実施 ▸ MTP対応モデルかは以下を確認 https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models

Update conﬁdential designator here Version number here V00000 25 •
最新の量子化アルゴリズムをサポートし、モデルサイズを削減しつつ、デグレーションを抑止 ▪ GPTQ, AWQ, Smooth Quant, Spin Quant, AutoRound ▪ INT8, FP8, INT4, NVFP4, MXFP4(experimental) ▪ Weight Quantization, Activation Quantization, KV Cache Quantization • vLLMでの実行に最適化されたSafetensor形式の拡張フォーマット • HF AutoModelとのシームレスな統合 Optimize ﬁne-tune models for inference LLM Compressor

Update conﬁdential designator here Version number here V00000 26 Activation
Quantization 一般的なモデルの重み（パラメータ）の量子化だけでなく、アクティベーション（中間データ）の量子化をサポート Weight Quantizationのみ (e.g. W4A16, W8A16) HBM Shared Memory (Tensor Core) Dequantization (逆量⼦化) GPUのVRAMに低精度(4 or 8bit)でモデルをロード演算時にパラメーターの逆量⼦化を⾏い、元精度(16bit)で計算 Weight and Activation Quantization (e.g. W4A4, W8A8) HBM Shared Memory (Tensor Core) GPUのVRAMに低精度(4 or 8bit)でモデルをロード逆量⼦化を⾏わず、低精度のまま演算を実⾏することでスループットを改善

Update conﬁdential designator here Version number here V00000 27 Activation
Quantization Tensore Coreにおける低精度での演算は GPUアーキテクチャにより対応状況が異なる Blackwell Hopper Ampere FP4未サポート FP4/FP8 未サポート

Update conﬁdential designator here Version number here V00000 28 MXFP4
/ NVFP4 https://developer.nvidia.com/blog/introducing-nvfp4-for-efﬁcie nt-and-accurate-low-precision-inference/ MXFP4: ▸ Open Computing Project (OCP)にて定義 ▸ 32個の4bitブロックごとにFP8(E8M0)のスケーリングファクターが存在。 NVFP4: ▸ NVIDIAが定義 ▸ 16個の4bitブロックごとにFP8(E4M3)のスケーリングファクターが存在。さらに全体のスケーリングを⾏うFP32のスケーリングファクターがある。

Update conﬁdential designator here Version number here V00000 29 LLM
Compressor v0.9 https://developers.redhat.com/articles/2026/01/16/llm-compressor-090-attention-quantization-mxfp4-support-and-more Attention and KV Cache Quantization Arbitrary KV Quantization Experiments • FP8, INT8, FP4, INT4 • Per tensor, per channel, per head, ect. KV Cache Quantizationのリファクタリング ▸ SpinQuantを導⼊し、回転⾏列を掛けることでKV Cacheの外れ値を抑制。KV Cache量⼦化の精度を向上。

Update confidential designator here Version number here V00000 vLLM Office
Hours 30 ▸ Bi-weeklyで最新のvLLMのアップデートについて配信。気になる⽅はRed Hat AIのXアカウントをチェック！ @RedHat_AI https://www.youtube.com/playlist?list=PLbMP 1JcGBmSHxp4-lubU5WYmJ9YgAQcf3 vLLM Office Hours Playlist

Update confidential designator here Version number here V00000 Contribute to
key vLLM features Comment and review PRs that are interesting to you. Join the discussion on RFCs. Check out “good first issue” tags. Give Us Feedback We’ll email you today’s recording as soon as it’s ready. Respond and tell us what we are doing right and what we can do better with vLLM office hours. Or comment on this slide! Join vLLM Developer Slack Ask questions and engage with us via Slack. Join here. Join Red Hat’s vLLM Mission Red Hat wants to bring open-source LLMs and vLLM to every enterprise on the planet. We are looking for vLLM Engineers to help us accomplish our mission. Apply here. Get involved with the vLLM Community 31

vLLM Community Meetup Tokyo #3 オープニングトーク

vLLM Community Meetup Tokyo #3 オープニングトーク

jpishikawa

More Decks by jpishikawa

Other Decks in Technology

Featured

Transcript

vLLM Community Meetup Tokyo #3 March 5, 2026

Version number here V00000 最近のvLLMプロジェクトについてのアップデート vLLM meetup オープニングトーク Red Hat

Update conﬁdential designator here Version number here V00000 Broad Model

Update conﬁdential designator here Version number here V00000 Broad Model

Update conﬁdential designator here Version number here V00000 10 vLLM

Update conﬁdential designator here Version number here V00000 11 vLLM

Update conﬁdential designator here Version number here V00000 12 vLLM

Update conﬁdential designator here Version number here V00000 13 Universal

Update conﬁdential designator here Version number here V00000 14 vLLM

Update conﬁdential designator here Version number here V00000 15 vLLM

Update conﬁdential designator here Version number here V00000 vLLM's new

Update conﬁdential designator here Version number here V00000 17 Changing

Update conﬁdential designator here Version number here V00000 18 vLLM's

Update conﬁdential designator here Version number here V00000 19 vLLM

Update conﬁdential designator here Version number here V00000 20 Speculative

Update conﬁdential designator here Version number here V00000 When to

Update conﬁdential designator here Version number here V00000 ▸ vLLMで実⾏可能なドラフトモデル作成のた

Update conﬁdential designator here Version number here V00000 https://huggingface.co/collections/RedHatAI/speculator-models 23

Update conﬁdential designator here Version number here V00000 24 Multi

Update conﬁdential designator here Version number here V00000 25 •

Update conﬁdential designator here Version number here V00000 26 Activation

Update conﬁdential designator here Version number here V00000 27 Activation

Update conﬁdential designator here Version number here V00000 28 MXFP4

Update conﬁdential designator here Version number here V00000 29 LLM

Update conﬁdential designator here Version number here V00000 vLLM Ofﬁce

Update conﬁdential designator here Version number here V00000 Contribute to