Upgrade to Pro — share decks privately, control downloads, hide ads and more …

vLLM Community Meetup Tokyo #3 オープニングトーク

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

vLLM Community Meetup Tokyo #3 オープニングトーク

Avatar for jpishikawa

jpishikawa

March 06, 2026
Tweet

More Decks by jpishikawa

Other Decks in Technology

Transcript

  1. Update confidential designator here Version number here V00000 Broad Model

    Support (>100 arches) 8 The High-Throughput and Memory-Efficient open-source inference engine for LLMs What is ? DeepSeek Wide Hardware Support CUDA ROCm Gaudi/XPU TPU Neuron CPU $ uv pip install vllm --torch-backend=auto $ vllm serve deepseek-ai/DeepSeek-V3.1 -tp 8 https://github.com/vllm-project/vllm Llama GPT-OSS Qwen Gemma Granite Ascend Phi Cohere Mistral >2000 Contributors from >50 major companies Diverse Project Ecosystem LLM Compressor vLLM Semantic Router Metal MACA RBLN Spyre MLU Kunlun Flexible Device Parallelism Tensor, Pipeline, Expert, Data, Context Parallel Disagg Prefill/Decode, Disagg Encoder Most Popular LLM Serving Engine • 70K+ GitHub stars, 800+ PRs/month • 500K++ GPUs deployed 24/7 • 2K+ contributors, 10K+ members in slack.vllm.ai
  2. Update confidential designator here Version number here V00000 Broad Model

    Support (>100 arches) 9 The High-Throughput and Memory-Efficient inference and serving engine for LLMs What is ? DeepSeek Wide Hardware Support CUDA ROCm Gaudi/XPU TPU Neuron CPU $ uv pip install vllm --torch-backend=auto $ vllm serve deepseek-ai/DeepSeek-V3.1 -tp 8 https://github.com/vllm-project/vllm Llama GPT-OSS Qwen Gemma Granite Ascend Phi Cohere Most Popular LLM Serving Engine • 65K+ GitHub stars, 800+ PRs/month • 500K++ GPUs deployed 24/7 • 2K+ contributors, 10K+ members in slack.vllm.ai Mistral >2000 Contributors from >50 major companies Diverse Project Ecosystem LLM Compressor vLLM Semantic Router Metal MACA RBLN Spyre MLU Kunlun Flexible Device Parallelism Tensor, Pipeline, Expert, Data, Context Parallel Disagg Prefill/Decode, Disagg Encoder GitHub Octoverse 2025 https://github.blog/news-insights/octoverse/octoverse-a-new-developer-joins-github-every-second-as-ai-leads-typescript-to-1/
  3. Update confidential designator here Version number here V00000 10 vLLM

    usage sample growth: 80% in Q1/Q2, 30% in Q3 Warning: this is a (small) biased subset that did not opt-out of usage data reporting.
  4. Update confidential designator here Version number here V00000 11 vLLM

    API (1): LLM class from vllm import LLM # Example prompts. prompts = ["Hello, my name is", "The capital of France is"] # Create an LLM with HF model name. llm = LLM(model="openai/gpt-oss-20b") # Generate texts from the prompts. outputs = llm.generate(prompts) # also llm.chat(messages)] A Python interface for offline batched inference
  5. Update confidential designator here Version number here V00000 12 vLLM

    API (2): OpenAI-compatible server $ vllm serve openai/gpt-oss-20b $ curl http://localhost:8000/v1/responses \ -H "Content-Type: application/json" \ -d '{ "model": "openai/gpt-oss-20b", "input": "tell me a 20 words story about a cat", }' A FastAPI-based server for online serving Server Client
  6. Update confidential designator here Version number here V00000 13 Universal

    Drop-in Compatible Server $ vllm serve openai/gpt-oss-20b 🆕 Multi-Modality Input API 🆕 Rerank, Pooling and Embedding API 🆕 Responses API 🆕 SageMaker API 🆕 Anthropic API 🆕 API for RL: Tokens-in-tokens-out 🆕 gRPC API 🆕 Omni Modality API
  7. Update confidential designator here Version number here V00000 14 vLLM

    Models: Qwen, Llama, Mistral, Gpt-OSS, DeepSeek Warning: this is a (small) biased subset that did not opt-out of usage data reporting.
  8. Update confidential designator here Version number here V00000 15 vLLM

    Hardware: AMD GPU single digit % instances Warning: this is a (small) biased subset that did not opt-out of usage data reporting.
  9. Update confidential designator here Version number here V00000 vLLM's new

    KV Offloading Connector https://blog.vllm.ai/2026/01/08/kv-offloading-connector.html Up to 9x increase in throughput on H100, 2x-22x reduction in TTFT for cache hits Try with: --kv_offloading_backend native --kv_offloading_size <GB> VRAM RAM Async ▸ VRAMに保存されたKV Cacheを⾮同期でCPU RAMにオフロード ▸ v0.11.0で導⼊、v0.14.0から--kv_offloading_sizeで指定可能に ▸ KV Offloading Connectorを介してVRAMのKV CacheをRAMに移動 ▸ LRU(Least Recently Used)に従い、直近で使われていないキャッ シュを⾮同期で移動 ▸ v0.12.0で、これまでレイヤごと分かれていたKV Cache Blockのレイ アウトを変更し、モデルごとにまとめたことでIOを改善 ▸ ハイブリッドKV Cache (GPT-OSS, Qwen3.5, etc.)への最適化はこれ から GPU CPU [RFC]: KV Offloading Roadmap #33689 https://github.com/vllm-project/vllm/issues/33689 16
  10. Update confidential designator here Version number here V00000 17 Changing

    vLLM’s Memory Layout ▸0.11.0以前はKV Blockがモデルのレイヤごとに分かれて保存されていた ▸0.12.0でこれらが変更され、レイヤを跨いでKV Blockを持つように ▸Blockサイズが⼤きくなったことでCPU<->GPU間のIOが改善 0.11.0 0.12.0 Llama-3.2-1B-Instruct 16 KB 0.5 MB Llama-3.1-8B-Instruct 32 KB 2 MB Llama-3.1-70B-Instruct (TP=4) 8 KB 1.25 MB Qwen/Qwen2.5-3B-Instruct 8 KB 0.44 MB Qwen/Qwen3-0.6B 32 KB 1.75 MB Qwen/Qwen2.5-7B-Instruct 16 KB 0.87 MB
  11. Update confidential designator here Version number here V00000 18 vLLM's

    new KV Offloading Connector https://blog.vllm.ai/2026/01/08/kv-offloading-connector.html Up to 9x increase in throughput on H100, 2x-22x reduction in TTFT for cache hits Try with: --kv_offloading_backend native --kv_offloading_size <GB> 18 CPU Hit Rateが上がるとトークンスループットが向上 ブロックサイズが⼤きくなるとDMAでのスループットが向上
  12. Update confidential designator here Version number here V00000 19 vLLM

    Project Ecosystem Various Projects for LLM Inference Optimization LLM Compressor
  13. Update confidential designator here Version number here V00000 20 Speculative

    Decoding Accelerate decoding phase with speculation - variety of methods. ドラフトモデルで複数先のトークンを出⼒、 ターゲットモデルで検証することでTPOT(Time per output token)を改善
  14. Update confidential designator here Version number here V00000 When to

    use Speculative Decoding? 21 Not a universal solution Speculative Decodingはコンピューティング能⼒ の余剰がある場合はメモリ帯域の節約に寄与する が、コンピューティング能⼒の余剰が無い場合は ロスに繋がる Use it for: ▸ RAG、エージェント、アシスタントなど レイテンシの要件が厳しいアプリケーション Avoid it if: ▸ バッチなどのthroughput-heavyな処理 ▸ 入力が長く、アウトプットが短い処理 (TTFTではなくTPOTを改善するため)
  15. Update confidential designator here Version number here V00000 ▸ vLLMで実⾏可能なドラフトモデル作成のた

    めのライブラリ ▸ ドラフトモデルのトレーニング、他ツール で作成されたEagle3モデルをvLLMで実⾏可 能な形式に変換 ▸ Hugging Face互換フォーマット https://github.com/vllm-project/speculators 22
  16. Update confidential designator here Version number here V00000 24 Multi

    Token Prediction with Qwen3.5 https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3.5.html Try with: --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' ▸ MTP(Multi Token Prediction)により、ある時点から複 数先のトークンを出⼒ ▸ 次のIterationで出⼒トークンを検証することで、ドラフ トモデルを使わずにSpeculative Decodingと同様のこと を実施 ▸ MTP対応モデルかは以下を確認 https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models
  17. Update confidential designator here Version number here V00000 25 •

    最新の量子化アルゴリズムをサポートし、モデルサ イズを削減しつつ、デグレーションを抑止 ▪ GPTQ, AWQ, Smooth Quant, Spin Quant, AutoRound ▪ INT8, FP8, INT4, NVFP4, MXFP4(experimental) ▪ Weight Quantization, Activation Quantization, KV Cache Quantization • vLLMでの実行に最適化されたSafetensor形式の 拡張フォーマット • HF AutoModelとのシームレスな統合 Optimize fine-tune models for inference LLM Compressor
  18. Update confidential designator here Version number here V00000 26 Activation

    Quantization 一般的なモデルの重み(パラメータ)の量子化だけでなく、アクティベーション(中間データ)の量子化をサ ポート Weight Quantizationのみ (e.g. W4A16, W8A16) HBM Shared Memory (Tensor Core) Dequantization (逆量⼦化) GPUのVRAMに低精 度(4 or 8bit)でモデ ルをロード 演算時にパラメーターの逆量⼦化を⾏ い、元精度(16bit)で計算 Weight and Activation Quantization (e.g. W4A4, W8A8) HBM Shared Memory (Tensor Core) GPUのVRAMに低精 度(4 or 8bit)でモデ ルをロード 逆量⼦化を⾏わず、低精度のまま演算を 実⾏することでスループットを改善
  19. Update confidential designator here Version number here V00000 27 Activation

    Quantization Tensore Coreにおける低精度での演算は GPUアーキテクチャにより対応状況が異なる Blackwell Hopper Ampere FP4未サポート FP4/FP8 未サポート
  20. Update confidential designator here Version number here V00000 28 MXFP4

    / NVFP4 https://developer.nvidia.com/blog/introducing-nvfp4-for-efficie nt-and-accurate-low-precision-inference/ MXFP4: ▸ Open Computing Project (OCP)にて定義 ▸ 32個の4bitブロックごとにFP8(E8M0)の スケーリングファクターが存在。 NVFP4: ▸ NVIDIAが定義 ▸ 16個の4bitブロックごとにFP8(E4M3)の スケーリングファクターが存在。さらに 全体のスケーリングを⾏うFP32のスケー リングファクターがある。
  21. Update confidential designator here Version number here V00000 29 LLM

    Compressor v0.9 https://developers.redhat.com/articles/2026/01/16/llm-compressor-090-attention-quantization-mxfp4-support-and-more Attention and KV Cache Quantization Arbitrary KV Quantization Experiments • FP8, INT8, FP4, INT4 • Per tensor, per channel, per head, ect. KV Cache Quantizationの リファクタリング ▸ SpinQuantを導⼊し、回転⾏列を 掛けることでKV Cacheの外れ値を 抑制。KV Cache量⼦化の精度を向 上。
  22. Update confidential designator here Version number here V00000 vLLM Office

    Hours 30 ▸ Bi-weeklyで最新のvLLMのアッ プデートについて配信。気にな る⽅はRed Hat AIのXアカウン トをチェック! @RedHat_AI https://www.youtube.com/playlist?list=PLbMP 1JcGBmSHxp4-lubU5WYmJ9YgAQcf3 vLLM Office Hours Playlist
  23. Update confidential designator here Version number here V00000 Contribute to

    key vLLM features Comment and review PRs that are interesting to you. Join the discussion on RFCs. Check out “good first issue” tags. Give Us Feedback We’ll email you today’s recording as soon as it’s ready. Respond and tell us what we are doing right and what we can do better with vLLM office hours. Or comment on this slide! Join vLLM Developer Slack Ask questions and engage with us via Slack. Join here. Join Red Hat’s vLLM Mission Red Hat wants to bring open-source LLMs and vLLM to every enterprise on the planet. We are looking for vLLM Engineers to help us accomplish our mission. Apply here. Get involved with the vLLM Community 31