The Making of AI Chips

The Making of AI Chips Yosuke Nakamura (Hardware) and Akira
Kawata (Software) 2026-05-15 談話会 at Kyoto University

2 The Making of AI Chips: The Hardware Side

• Introduction / PFN’s AI initiatives • Environment and challenges
surrounding AI Chips • MN-Core design philosophy for AI Chips • MN-Core technologies for AI Chips ◦ Reduction operations and synchronization ◦ Bandwidth and register capacity ◦ Power control ◦ Integration with 3D-stacked DRAM • Summary Topics - The Hardware Side

4 PFN Confidential Carrier • Apr. 2025 – Present: Preferred
Networks (PFN) • 2008 – 2025: Fujitsu Limited • Ph.D., Information Science and Technology, The University of Tokyo Expertise • Advanced ASIC development for HPC systems, mainframes, and UNIX servers • End-to-end design: performance, hardware specs, architecture, circuits, veriﬁcation, quality Focus • Bringing research into real-world systems (e.g., Fugaku) Self Introduction: Yosuke Nakamura

5 PFN: Vertically Integrating AI Value Chain AI products and
solutions Computing infrastructure AI chips Generative AI foundation models MN-Core MN-Core 2 Next-generation PFP MN-Core L1000 （launch planned in 2027） PFN develops the full AI technology stack in-house, from AI chips and computing infrastructure to generative AI foundation models and AI solutions. Our vertically integrated approach across these four layers enables us to solve complex, hard-to-address challenges. Large language model Model for simulating material energy GPU cluster MN-3 (MN-Core™ cluster) Cloud-based computing service powered by MN-Core™ 2

6 MN-Core™ Series Roadmap Anticipating the demand growth of semiconductors
for AI computing resources, PFN started developing the ﬁrst generation of AI processor in the MN-Core™ series in 2016. Currently, PFN is developing high-bandwidth chips specialized for generative AI. Details: https://projects.preferred.jp/mn-core/en/ AI Chips 2016 2020 2023 2027 MN-Core L1000 AI inference Development: 2024- Planned launch: 2027 MN-Core (TSMC 12nm) AI training, AI inference, HPC Development: 2016- Internal use: 2020- Compute for external use: 2023- MN-Core 2 (TSMC 7nm) AI training, AI inference, HPC Test operation: 2023- External use of servers/ compute via PFCP™: 2024- MN-Core L2000 Massive AI inference, HPC Development: 2025- Planned launch: 2027 Next- Generation AI training, massive AI inference, HPC In development In discussion

7 Energy-Efficient Computing Infrastructure PFN pursues highly energy-efficient computing infrastructure.
Powered by PFN’s own AI chip MN-Core™ (first generation), MN-3 has topped the Green500 ranking of the world’s most energy-efficient supercomputers three times in June 2020, June 2021 and November 2021. Computing Infrastructure Jun. 2021 No. 1 Jun. 2020 No. 1 Nov. 2021 No. 1 Details: https://projects.preferred.jp/en/supercomputers/

Challenges and Landscape Surrounding AI Chips

Growth of Computational Power 9 Conventional Era Computational power grew
4× every 3 years Driven by Moore’s Law Deep Learning Era Compute required for state-of-the-art AI models doubles every 5.5 months (~4.6× per year) → AI demands an unprecedented scale of computation xAI Source : Epoch AI

10 Key Property • Highly parallelizable architecture (scales well on
GPUs) What was optimized in R&D • Maximizing training throughput (large-scale training prioritized) Resulting Trade-oﬀ • Inference eﬃciency is not fully optimized • High compute and memory bandwidth required in production Recent Model : Transformer’s Design Philosophy 2012 : AlexNet 2021 : Scaling Law 2023 : Chat GPT R&D Era Scaling Era Production Era

11 Each token generation requires reading the model (weights) from
memory. → Model size ∝ memory access volume (For a 70B model: ~70GB in FP8, ~140GB in FP16) Bandwidth becomes the bottleneck, directly impacting UX. output output output input input input output output output output output input input input input input KV Cache KV Cache KV Cache Recent generative AI: The number of output tokens is increasing due to reasoning and thinking processes. “s1: Simple test-time scaling” (Muennighoff et al.), arXiv:2501.19393, 2025. Inference Bottleneck

AI Scaling is Hitting Power Limits • Flagship AI processors
now exceed 1 kW TDP (e.g., GB200 ~2.7 kW, Falcon Shores ~1.5 kW) • Rack-level power surpasses 100 kVA (e.g., DGX GB200 NVL72 ~120 kVA) • Moore’s Law scaling is slowing → approaching facility limits → Performance per watt is now the key driver of scaling NVIDIA Blackwell Architecture Technical Brief Power Challenges

MN-Core’s design philosophy for AI Chips

Energy-eﬃcient AI Processor MN-Core™ Series Concept, Design Principles & Targets
• Proprietary Design Philosophy • Feedback from the AI Workload Research Team • Simple, High Performance, High Bandwidth, Low Power

15 The Three Elements That Deﬁne MN-Core Ultra-wide SIMD Vast
Local Memory Explicit Data Transfer

16 MN-Core™ Series: Design Philosophy The MN-Core architecture maximizes the
proportion of arithmetic units on the hardware by transferring functions normally allocated to the hardware side to the software side, realizing high performance and energy eﬃciency. AI Chips Optimization Code generation General-purpose processor MN-Core series Software Register Arith- metic units Command scheduler Cache controller Network control circuit Hardware DRAM I/F On-chip memory On-chip network Optimization Code generation Software Command scheduler Cache controller Network control Register Arithmetic units Hardware DRAM I/F On-chip memory On-chip network Details: https://projects.preferred.jp/en/mn-core/

Conventional existing workloads What’s important for ‘High Compute Eﬃciency x
Low Power’ AI Chip AI compute has a different profile from existing general purpose workloads Innovation in HW architecture + Global Optimization to maximize the its value by SW AI compute Tailor made workloads, using conditional branch and loops, provide dynamic procedure AI models, as a compute graph that consists of standardized operations, provide static procedure Only very limited area is dedicated to floating point units, while the other areas are utilized to support the “dynamic procedure” such as OoO scheduler or cache controller. There is a chance of innovations in architecture based on the static nature Existing General Purpose Processors

18 Ultra-wide SIMD Matrix MAC operation A x B +
C One single instruction drives 1024 matrix units simultaneously MN-Core 2 structure from Hot Chip 2024 presentation By a 16-bit float operation: • 12FLOP/cycle per a single MAB • Totally 524,288 FLOP/cycle by a single die (512 x 16 x 8 x 8) Hierarchically tiled arithmetic blocks L2B: Level-2 Broadcasting Block L1B: Level-1 Broadcasting Block MAB: Matrix Arithmetic Block MAU: Matrix Arithmetic Unit

19 Vast Local Memory • Two GRFs (256 64-bit words
each) 1R1W • Two Local memories (2048 64-bit words each) 1RW Each PE has its own very large Local Memory and Register File. (Not a shared memory !!) PE

20 Overview of Controlling MN-Core Matrix Array Unit (MAU) Arithmetic
Logic Unit (ALU) DRAM Host Memory MN-Core DirectConnect

21 Explicit Data Transfer by Cacheless Architecture L1BM L1BM L2BM
PE PE PE PE PE PE ＄＄ DRAM L1BM PE PE L1BM PE PE PE L2BM L1BM PE PE PE ＄＄ Because there is no data cache, software can explicitly specify when from where, and to where data should be transferred The explicit data transfer enables software to do broadcasting or reduction operation at the same time

Five Advantages of MN-Core

23 ①Eﬃciency Improvements with Reduction Operations • For example, in
Convolution, a sum is taken along the channel-wise • On a GPU, each compute unit (SM, etc.) computes the convolution for each channel, or computes partial sums that add together several channels, and then ﬁnally one compute unit takes the total sum through the L2 cache • On MN-Core, the total sum can be taken while moving the computation results from one hierarchy to another • Time for data transfers can be utilized to perform reduction, eliminating separate reduction steps. Compute Units Do Not Sit Idle for Reduction Operations Explicit Data Transfer

24 ②Easy Synchronization, with Less Waiting Time • No explicit
synchronization is needed during execution • This is because all MABs follow a single instruction stream and execute in lockstep • In addition, memory access, communication, and synchronization are statically scheduled at compile time • As a result, no dynamic synchronization or waiting is required, significantly reducing idle time → MN-Core achieves high compute efficiency can be achieved with minimal effort No software barriers for synchronization are required, as all computation completes within a few clock time differences. Ultra-wide SIMD

25 ③Eﬃcient Bandwidth Utilization • In conventional architectures, part of
the bandwidth is used for cache coherency and dynamic data movement • To sustain high performance, large interconnect bandwidth is required in modern multi-die systems • In MN-Core, data placement is explicitly managed by software, so unnecessary data movement is minimized → Bandwidth can be used more directly for computation → Higher bandwidth utilization and hardware eﬃciency Bandwidth is primarily used for actual data movement Explicit Data Transfer

26 ④ Large Local Capacity per Compute Unit “Vast Local
Memory” keep more data close to computation FP16 performance per Matrix Unit Register capacity per Matrix Unit Description RTX5090 Tensor Core 616 GFLOPS 63.8KB Derived from 419TFLOPS / 680 Tensor Cores. 32 bits x 32 threads x 255 registers / thread MN-Core2 MAB 384 GFLOPS 144.0KB 4 PEs x {GRF 4KB + LM 32KB} https://images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf https://docs.nvidia.com/cuda/blackwell-tuning-guide/index.html Vast Local Memory • MN-Core provides large local memory (register + local memory) per compute unit • This allows more data to stay on-chip during kernel execution • Improves data reuse and reduces repeated memory access → More eﬃcient kernel execution, especially for loop-based workloads

27 ⑤Predictable Power Consumption for stable operation • In recent
semiconductor industry, power consumption is getting bigger and bigger like 1.x kW, and the supply voltages are kept around 0.7V or so. ◦ Thus, current consumption is getting bigger like 2kA i(t)=P(t)/V • Diﬃculty is that current varies much according to the compute load. → Supply voltage could vary much ◦ parasitic inductances matter: v(t) = V0 - R*i(t) + L*di(t)/dt Deterministic nature with the wide SIMD units and the explicit data transfer allows the software to predict and control the power proﬁle Ultra-wide SIMD Explicit Data Transfer L MN-Core 0.7V V0 i(t) v(t) R

28 ⑤Predictable Power Consumption for stable operation • A sudden
current changes lead voltage spikes, which could cause malfunction if the voltage goes beneath “Vmin”. So gradual change in compute load is key for the stable operation. • Although MN-Core could cause a sudden changes due to its wide SIMD operation, the deterministic nature allows to predict the changes. The software can utilize the control knobs like clock frequency, nop insertion etc. to suppress sudden changes. • In other words, MN-Core doesn’t require unnecessary guard bands and can exploit higher performance while maintaining stable operation. ◦ Of course, users don’t have to care this control Deterministic nature with the wide SIMD units and the explicit data transfer allows the software to predict and control the power proﬁle i(t) v(t) voltage waveform current waveform Huge current (A sudden compute load) V0 Vmin malfunction Ultra-wide SIMD Explicit Data Transfer

MN-Core L1000: LLM inference accelerator (Und. Dev) Vertical integration of
3D DRAM and logic wafer. Outstanding high memory bandwidth and large capacity. 3D DRAM High-density MN-Core Practical capacity for LLM 3D DRAM achieves capacity equivalent to HBM. Ultra-High Speed Vertical Connection Exceeds HBM bandwidth Reduction of Inter-Wafer Distance achieves communication speeds equivalent to SRAM

MN-Core L1000 L1 L2 L1 ・・・・・ L1 ・・・ GPU
3D Stacked DRAM Processing Unit L2 Cache L1 Cache L1 Cache L1 Cache Memory Processing Unit Processing Unit Processing Unit • Distributed memory architecture at the PE level ◦ Pair of PE and 3D DRAM in a tile pattern • Low power consumption and deterministic power control GPU uses shared memory (L2) - Coherent Protocol in overall L1000 uses distributed memory - Local access at local memory High compatibility with 3D DRAM

Memory Technology MN-Core L Series Architecture Logic Mem (HBM) NVIDIA
SambaNova Google AWS Intel AMD etc… Logic Mem (SRAM) Logic Memory Groq Cerebras 👍 Speed 👍 Capacity 👍👍👍Speed 😐 Capacity 👍👍👍 Speed 👍 Capacity Fully Distributed Memory Architecture 3D Stacked DRAM Processing Units in chip network ①Short data move ②Data stays near-memory 3D DRAM HBM (High Bandwidth Memory) SRAM

Our plan :　MN-Core L1000 Series speed up LLM inference 10x.

Summary of the Hardware Side • LLMs are reshaping computing
requirements. ◦ Models are growing rapidly—larger, deeper, and more complex. • PFN provides end-to-end support. ◦ From software to hardware. ◦ Key hardware bottlenecks: compute, memory bandwidth, and power. • PFN is developing MN-Core, a dedicated AI Chips. • Key features address these challenges: ◦ Massive SIMD for parallel computation. ◦ Large local memory for efficient bandwidth use. ◦ Explicit data transfer to reduce synchronization. ◦ Simplified SIMD for better power efficiency. ◦ 3D DRAM for higher bandwidth and performance.

35 The Making of AI Chips: The Software Side

36 Self introduction: Akira Kawata Short history • Mar. 2020:
Received M.S. in Informatics, Kyoto University • Apr. 2020: Joined Preferred Networks (PFN) Current work • Compiler Engineer, Preferred Networks • Worked on the MN-Core software stack • Focused on the runtime system, including the Python interface, C++ middleware, and device drivers Interests • Binary hacks • https://akawashiro.com/

37 • How we use MN-Core with the software stack
today • How the MN-Core software stack works • How we have developed the MN-Core software stack Today’s Topics

38 How We Use MN-Core with the Software Stack Today

39 We Are Using MN-Core AI-based material search (Matlantis) MN-Core
is already used in production Image recognition (Kachaka) 3D model generation (PFN 3D-Scan) LLM inference (example)

40 • You can make your own LLM using MN-Core
• https://playground.mn-core.com/slm-customize You Can Use MN-Core Now

41 • The SDK is available • https://dev.mn-core.com/sdk/0.4/MLSDK/docs/en/ You Can
Write Your Own Program on MN-Core Document (EN/JP) Image generation example

42 How the MN-Core Software Stack Works

43 Overview of MN-Core Software Stack MN-Core Runtime Instruction Emission
Layout and Location Planning Graph Extraction MNCL Host Runtime MNCL Device Translator MNACC Graph Lowering I will focus on this part today

44 from mlsdk import compile def double(input): x = input["x"]
return {"out": x * 2} sample = {"x": torch.zeros(3, 4)} compiled_double = compile( double, sample, ) result = compiled_double( {"x": torch.ones(3, 4)}) Example

return {"out": x * 2} sample = {"x": torch.zeros(3, 4)} compiled_double = compile( double, sample, ) result = compiled_double( {"x": torch.ones(3, 4)}) Example Define a function to compile Compile the function Run a compiled function

47 Compiler Overview def double(input): x = input["x"] return {"out":
x * 2} Metadata Machine instructions lpassa $lm0v $ln0v lpassa $lm8v $ln8v lpassa $lm16v $ln16v lpassa $lm24v $ln24v ……. Python source ONNX High-level IR Low-level IR Machine instructions Binary file

48 Convert a Python function into an ONNX graph by
extracting the computation graph • ONNX is a graph format widely used in machine learning. • An edge in ONNX corresponds to an array. • A node in ONNX corresponds to an operation on arrays in Python. Python to ONNX def double(input): x = input["x"] return {"out": x * 2} Python source ONNX

49 Decide the location and layout of each ONNX edge.
• A high-level IR graph is an extended version of ONNX. • Location describes where each array is placed. ◦ MN-Core has fast local memory and large but slower DRAM. ◦ Because MN-Core has no caches, we must schedule memory transfers explicitly. • Layout describes how arrays are mapped onto MN-Core's tree structure. ◦ See our blog[1] for details of layout. ONNX to High-Level IR Graph [1]: https://tech.preferred.jp/ja/blog/mn-core-tensor-layout/ ONNX High-level IR

50 Convert each node in the high-level IR graph into
a low-level IR graph. • A low-level IR graph is a low-level, instruction-level intermediate representation. • Each node in the left high-level IR graph is translated into a right low-level IR graph. • So this process generates multiple low-level IR graphs from one high-level IR graph. High-Level IR Graph to Low-Level IR Graph High-level IR Low-level IR

51 Generate a sequence of machine instructions from a low-level
IR graph. • It tries to emit shorter instructions from the given low-level IR graph. • Because MN-Core has no jump or branch instructions, a shorter instruction sequence directly leads to better performance. From Low-Level IR Graph to Machine Instructions Low-level IR lpassa $lm0v $ln0v lpassa $lm8v $ln8v lpassa $lm16v $ln16v lpassa $lm24v $ln24v ……. Machine instructions

52 Instruction sequences are packed into one binary file with
some meta information. • After translating all low-level IR graphs, they are concatenated and packed into one binary file. • A binary file contains: ◦ Instruction sequences ◦ Input/output information ◦ Relocation information Binary File Packing lpassa $lm0v $ln0v lpassa $lm8v $ln8v lpassa $lm16v $ln16v lpassa $lm24v $ln24v ……. Machine instructions Metadata Machine instructions Binary file lpassa $lm0v $ln0v lpassa $lm8v $ln8v lpassa $lm16v $ln16v lpassa $lm24v $ln24v ……. lpassa $lm0v $ln0v lpassa $lm8v $ln8v lpassa $lm16v $ln16v lpassa $lm24v $ln24v …….

54 The runtime controls MN-Core from the host computer. •
Execution of compiled functions • Data transfer between the host computer and MN-Core • MN-Core memory management What the Runtime Does Host computer MN-Core Runtime

55 Run a compiled binary on MN-Core • Fix up
instructions in a binary ﬁle ◦ so that the addresses match the given inputs and outputs ◦ See my earlier talk[1] on relocation in the MN-Core compiler/runtime. • Send relocated instructions to MN-Core. Program Execution lpassa $lm0v $ln256v lpassa $lm8v $ln264v lpassa $lm16v $ln272v lpassa $lm24v $ln280v ……. [1] : https://speakerdeck.com/pfn/20241213_pfn_camphor Metadata Machine instructions Binary file MN-Core Relocated instructions Rewrite addresses on the fly

56 Transfer arrays between the host computer and MN-Core •
Data conversion between a Python array and the MN-Core data format. ◦ Data reordering ◦ Floating-point conversion ▪ For example: IEEE 754 32-bit float ⇔ MN-Core 2 block float • Transfer arrays between the host computer and MN-Core. ◦ Split the data into chunks that fit the DMA unit ◦ Control Direct Memory Access (DMA) Data Transfer Between Host and Device Python array = [0, 1, 2, …. ] MN-Core data format 01010101... MN-Core

57 Manage MN-Core DRAM memory • An array object in
Python corresponds to a region in device memory. ◦ For example, result in the code corresponds to a region in device memory. • That device memory is freed automatically when the corresponding object is destroyed. Device Memory Management compiled_double = compile( double, sample, ) result = compiled_double( {"x": torch.ones(3, 4)}) MN-Core Device Memory

58 How We Have Developed the MN-Core Software Stack

59 • We have continuously improved our software stack by
using it internally. • We have ported many applications to MN-Core by ourselves. ◦ ResNet50 ◦ Network Architecture Search ◦ LLM training…etc • Application porting consists of the following steps ◦ Rewrite Python code ◦ Emit a valid high-level IR graph ◦ Implement missing operators ◦ Debug ◦ Performance tuning ◦ Deliver the result How We Have Developed the MN-Core Software Stack

60 • Machine learning applications use many libraries. • We
need to peel it back and re-implement the core part. • Sometimes, we need to change the algorithm also because of hardware characteristics. Rewriting Python to Fit Our Software Stack def double(input): x = input["x"] return {"out": x * 2} result = double({"x": torch.ones(3, 4)}) from mlsdk import compile def double(input): x = input["x"] return {"out": x * 2} sample = {"x": torch.zeros(3, 4)} compiled_double = compile( double, sample, ) result = compiled_double( {"x": torch.ones(3, 4)})

61 Sometimes, the compiler fails to ﬁnd a valid schedule
and cannot emit a high-level IR graph. In such cases, • Fix the compiler itself. • Rewrite the Python script to make it more compiler-friendly. ◦ For example, our compiler handles arrays with shapes that are powers of two more easily. • Add compiler hints. Emit a Valid High-Level IR Graph # Example of compiler hints { "trigger": "Mul", "layout": "((3:4), (4:1))" }

62 • After emitting a valid high-level IR graph succeeds,
we need to compile all operators in the high-level IR graph. • We make a big spreadsheet, divide the failing operators among teams, and implement them. Fixing Operator Implementations Zoom in

63 • In many cases, we need to debug after
the compiler successfully emits a binary file. • Debugging tends to be hard because ◦ We build the entire software and hardware stack ourselves. ◦ Machine learning applications often appear to work even when the stack contains bugs. And that also makes bugs harder to find. ◦ Machine learning applications often take a very long time to run. Sometimes, it takes a whole day just to reproduce a single bug. • We debug by checking the most likely causes one by one. ◦ We debug by checking the most likely causes one by one, inspecting logs, machine instructions, and binary files very carefully. Debugging

64 • Problem: ◦ Software initialization does not complete. •
Possible causes: ◦ Software deadlock ◦ Hardware FIFO overﬂow ◦ Failure during hardware initialization ▪ We checked hardware registers to ﬁnd where execution gets stuck. Debug Example: Stuck in Software Initialization HW Module A HW Module B HW Module C HW Module E HW Module D Oops, this module doesn’t receive any data!

65 • Problem: ◦ Training loss doesn’t change. • Possible
causes: ◦ Bug in an operator implementation ◦ Memory address mismatch ◦ Missing metadata in the binary ﬁle Debug Example: When the Loss Doesn’t Change loss iterations 9.13 1 9.13 2 9.13 3 9.13 4 9.13 5 9.13 6 9.13 7 9.13 8 9.13 9 9.13 10

66 • Problem: ◦ Loss curve doesn’t match well between
MN-Core and GPU. ◦ A model trained on MN-Core performs worse than the same model trained on a GPU. • Possible causes: ◦ Bug in the training script ◦ Insuﬃcient ﬂoating-point precision Debug Example: When the Trained Model Performs Poorly

67 We sometimes build debuggers to find a bug. •
Some complex bugs cannot be resolved only by checking logs, binary files, and source code. • We built a memory-overwrite detector. ◦ Dump the allocator's memory map ◦ Dump the hardware memory map ▪ Fill the memory with a dummy value initially and read the whole area after running the application. ◦ Compare them to find mismatches • We came up with this idea from the mechanism of AddressSanitizer. See [1]. Building a Debugger Memory map of allocator Used Used Used Used Someone overwrote this by mistake! [1]: Binary Hacks Rebooted - O'Reilly Japan #47 Memory map dumped

68 Check the trace carefully and shorten the wall clock
time • We are using Perfetto to record execution. ◦ Perfetto is a tracing tool for performance analysis. • When we ﬁnd the slow task, speed it up. • When we ﬁnd an unnecessary serial execution of tasks, parallelize them. Performance Tuning

69 Finally, we package the MN-Core software stack and the
application into a Docker image and deliver it to the application team. • A Docker image ensures reproducibility. • Docker images are widely used at PFN. Delivering Compiler team Application team Docker image

70 • In the LLM era, it is increasingly important
to satisfy compute performance, memory bandwidth, and power eﬃciency at the same time. • To address this, PFN is designing MN-Core’s dedicated hardware and software stack in an integrated manner. • On the hardware side, MN-Core aims to achieve high eﬃciency, high bandwidth, and low power consumption through massive SIMD, vast local memory, and explicit data transfer. • On the software side, PFN has built its own compiler, runtime, and tools to make MN-Core usable for real applications. • PFN’s strength lies in its ability to continuously improve both hardware and software through real-world operation. Summary

Making the real world computable

The Making of AI Chips

The Making of AI Chips

More Decks by Preferred Networks

Other Decks in Technology

Featured

Transcript