Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How the Next Generation of AI Models are Going ...

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

How the Next Generation of AI Models are Going to Completely Change AI Inference

Diffusion-based language models are on track to disrupt the entire AI inference stack, overturning assumptions that currently drive hundreds of billions in hardware investment. The shift is driven by a fundamental inversion: Autoregressive (AR) models are memory‑bound; diffusion models are compute‑bound. Because modern hardware has excess compute and starved memory bandwidth, diffusion aligns far better with the silicon the industry has already built.

Avatar for dyb

dyb PRO

May 01, 2026

Resources

More Decks by dyb

Other Decks in Research

Transcript

  1. How Diffusion Models Will Completely Change AI Inference How the

    Next Generation of AI Models are Going to Completely Change AI Inference
  2. The Autoregressive Memory Trap: GPUs Running at Under 1% Utilization

    THE CORE PROBLEM < 1% A $40,000 GPU executes at under 1% of peak compute during single-user LLM inference. The chip waits 99% of the time for data to arrive. The constraint is not multiplication speed — it is memory bandwidth. THE MATH 1 : 300 AR models need ~1 FLOP per byte moved. A Hopper/Blackwell tensor core needs ~300 FLOPs/byte to stay fed. This 300:1 mismatch means the silicon starves while gigabytes of weights and KV cache stream across the chip. THE KV CACHE TAX 140 GB For a 70B-parameter model, 140 GB of weights must stream from HBM into SRAM for every single token. The KV cache grows with sequence length, eating memory capacity. PagedAttention and continuous batching are software patches — when they hit a wall, hardware panics. THE INFRASTRUCTURE PATCH $100B+ Hundreds of billions deployed on a single assumption: that AI generation will always be memory-bottlenecked. HBM4, NVLink 5, photonic interconnects — every upgrade is a data-movement patch, not a compute upgrade. The physical reality of chip manufacturing makes this economically brutal. 02
  3. The Bottleneck Inversion: From Bandwidth-Starved to Compute-Bound AUTOREGRESSIVE LLM Matrix-Vector

    Ops · Bandwidth-Starved Sequential token generation. One token at a time. KV cache residency tax. Under 1% GPU utilization. DIFFUSION MODEL Matrix-Matrix Ops · Compute-Bound Parallel sequence refinement. Entire blocks denoised simultaneously. No KV cache. Near-full GPU utilization. 3x / 2yr COMPUTE SCALING Hardware like Blackwell and Rubin compounds FLOPs at 3x every two years. Memory bandwidth has stalled. The industry accidentally built perfect silicon for diffusion. 0 KV CACHE TAX Diffusion models refine sequences in parallel. The KV cache tax vanishes. The bottleneck flips from memory bandwidth to raw compute. FP4/FP8 QUANTIZATION FRIENDLY Diffusion is naturally more robust to quantization. Low-precision formats like FP4 and FP8 compound beautifully, amplifying the compute advantage. 03
  4. Inference Becomes Search: 4x Quality for a 1.6x Compute Tax

    1.6x Compute Cost for 4x Search +40 pts LogicDiff GSM8K Gain 4.2M Parameters in Scheduling Hack Why Branching Diffusion Is Cheap Early denoising steps build coarse shapes that multiple candidates can share. They only split at the end when fine details are committed. For k candidates sharing the first s steps of an N-step trajectory: m_branch = k - (k-1) * s/N 4 candidates sharing 40 of 50 steps = 1.6x cost for 4x quality search. Branching AR is financially ruinous; branching diffusion is cheap. The Verifier Moat: Value Migration Diffusion turns generation into a branching search problem judged by a secondary "verifier" model. A DFS search guided by an object-detection verifier catches errors mid-trajectory and forces a different path. If a 4.2M-parameter scheduling hack (LogicDiff) can spike reasoning by 40 points without changing base parameters, the hyperscaler gigawatt- compute thesis starts looking fragile. Value migrates from the $2B base model to the companies building elite, proprietary verifier suites. 04
  5. Who builds the best map wins. The Vendor Landscape NVIDIA

    retains its moat via CUDA flexibility for compound diffusion pipelines. AMD's massive HBM capacity hedges against video diffusion activation spikes. Groq / ASICs are vulnerable to inter-chip latency for complex search trees. Apple / Qualcomm will strip-mine edge volume when step counts collapse to 1-4. Source: "How the Next Generation of AI Models are Going to Completely Change AI Inference" by Devansh, April 2026