Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Making of AI Chips

The Making of AI Chips

This presentation was for a student discussion session at Kyoto University's Department of Communications and Information Systems, in which Preferred Networks (PFN) introduced the MN-Core series of AI chips from both hardware and software perspectives. It covers the challenges in AI computing such as computing performance, memory bandwidth, and power efficiency; the design philosophy behind MN-Core; and practical aspects of compilers that connect Python to MN-Core, runtime environments, and debugging. | 京都大学 通信情報システム専攻の学生向け談話会の資料です。PFNで開発しているMN-Coreシリーズについて、ハードウェアとソフトウェアの両面から紹介しました。AI計算における計算性能・メモリ帯域・電力効率の課題、それに対するMN-Coreの設計思想、さらにPythonからMN-Coreまでをつなぐコンパイラ・ランタイム・デバッグの実際について説明しています。

Avatar for Preferred Networks

Preferred Networks PRO

May 21, 2026

More Decks by Preferred Networks

Other Decks in Technology

Transcript

  1. The Making of AI Chips Yosuke Nakamura (Hardware) and Akira

    Kawata (Software) 2026-05-15 談話会 at Kyoto University
  2. • Introduction / PFN’s AI initiatives • Environment and challenges

    surrounding AI Chips • MN-Core design philosophy for AI Chips • MN-Core technologies for AI Chips ◦ Reduction operations and synchronization ◦ Bandwidth and register capacity ◦ Power control ◦ Integration with 3D-stacked DRAM • Summary Topics - The Hardware Side
  3. 4 PFN Confidential Carrier • Apr. 2025 – Present: Preferred

    Networks (PFN) • 2008 – 2025: Fujitsu Limited • Ph.D., Information Science and Technology, The University of Tokyo Expertise • Advanced ASIC development for HPC systems, mainframes, and UNIX servers • End-to-end design: performance, hardware specs, architecture, circuits, verification, quality Focus • Bringing research into real-world systems (e.g., Fugaku) Self Introduction: Yosuke Nakamura
  4. 5 PFN: Vertically Integrating AI Value Chain AI products and

    solutions Computing infrastructure AI chips Generative AI foundation models MN-Core MN-Core 2 Next-generation PFP MN-Core L1000 (launch planned in 2027) PFN develops the full AI technology stack in-house, from AI chips and computing infrastructure to generative AI foundation models and AI solutions. Our vertically integrated approach across these four layers enables us to solve complex, hard-to-address challenges. Large language model Model for simulating material energy GPU cluster MN-3 (MN-Core™ cluster) Cloud-based computing service powered by MN-Core™ 2
  5. 6 MN-Core™ Series Roadmap Anticipating the demand growth of semiconductors

    for AI computing resources, PFN started developing the first generation of AI processor in the MN-Core™ series in 2016. Currently, PFN is developing high-bandwidth chips specialized for generative AI. Details: https://projects.preferred.jp/mn-core/en/ AI Chips 2016 2020 2023 2027 MN-Core L1000 AI inference Development: 2024- Planned launch: 2027 MN-Core (TSMC 12nm) AI training, AI inference, HPC Development: 2016- Internal use: 2020- Compute for external use: 2023- MN-Core 2 (TSMC 7nm) AI training, AI inference, HPC Test operation: 2023- External use of servers/ compute via PFCP™: 2024- MN-Core L2000 Massive AI inference, HPC Development: 2025- Planned launch: 2027 Next- Generation AI training, massive AI inference, HPC In development In discussion
  6. 7 Energy-Efficient Computing Infrastructure PFN pursues highly energy-efficient computing infrastructure.

    Powered by PFN’s own AI chip MN-Core™ (first generation), MN-3 has topped the Green500 ranking of the world’s most energy-efficient supercomputers three times in June 2020, June 2021 and November 2021. Computing Infrastructure Jun. 2021 No. 1 Jun. 2020 No. 1 Nov. 2021 No. 1 Details: https://projects.preferred.jp/en/supercomputers/
  7. Growth of Computational Power 9 Conventional Era Computational power grew

    4× every 3 years Driven by Moore’s Law Deep Learning Era Compute required for state-of-the-art AI models doubles every 5.5 months (~4.6× per year) → AI demands an unprecedented scale of computation xAI Source : Epoch AI
  8. 10 Key Property • Highly parallelizable architecture (scales well on

    GPUs) What was optimized in R&D • Maximizing training throughput (large-scale training prioritized) Resulting Trade-off • Inference efficiency is not fully optimized • High compute and memory bandwidth required in production Recent Model : Transformer’s Design Philosophy 2012 : AlexNet 2021 : Scaling Law 2023 : Chat GPT R&D Era Scaling Era Production Era
  9. 11 Each token generation requires reading the model (weights) from

    memory. → Model size ∝ memory access volume (For a 70B model: ~70GB in FP8, ~140GB in FP16) Bandwidth becomes the bottleneck, directly impacting UX. output output output input input input output output output output output input input input input input KV Cache KV Cache KV Cache Recent generative AI: The number of output tokens is increasing due to reasoning and thinking processes. “s1: Simple test-time scaling” (Muennighoff et al.), arXiv:2501.19393, 2025. Inference Bottleneck
  10. AI Scaling is Hitting Power Limits • Flagship AI processors

    now exceed 1 kW TDP (e.g., GB200 ~2.7 kW, Falcon Shores ~1.5 kW) • Rack-level power surpasses 100 kVA (e.g., DGX GB200 NVL72 ~120 kVA) • Moore’s Law scaling is slowing → approaching facility limits → Performance per watt is now the key driver of scaling NVIDIA Blackwell Architecture Technical Brief Power Challenges
  11. Energy-efficient AI Processor MN-Core™ Series Concept, Design Principles & Targets

    • Proprietary Design Philosophy • Feedback from the AI Workload Research Team • Simple, High Performance, High Bandwidth, Low Power
  12. 16 MN-Core™ Series: Design Philosophy The MN-Core architecture maximizes the

    proportion of arithmetic units on the hardware by transferring functions normally allocated to the hardware side to the software side, realizing high performance and energy efficiency. AI Chips Optimization Code generation General-purpose processor MN-Core series Software Register Arith- metic units Command scheduler Cache controller Network control circuit Hardware DRAM I/F On-chip memory On-chip network Optimization Code generation Software Command scheduler Cache controller Network control Register Arithmetic units Hardware DRAM I/F On-chip memory On-chip network Details: https://projects.preferred.jp/en/mn-core/
  13. Conventional existing workloads What’s important for ‘High Compute Efficiency x

    Low Power’ AI Chip AI compute has a different profile from existing general purpose workloads Innovation in HW architecture + Global Optimization to maximize the its value by SW AI compute Tailor made workloads, using conditional branch and loops, provide dynamic procedure AI models, as a compute graph that consists of standardized operations, provide static procedure Only very limited area is dedicated to floating point units, while the other areas are utilized to support the “dynamic procedure” such as OoO scheduler or cache controller. There is a chance of innovations in architecture based on the static nature Existing General Purpose Processors
  14. 18 Ultra-wide SIMD Matrix MAC operation A x B +

    C One single instruction drives 1024 matrix units simultaneously MN-Core 2 structure from Hot Chip 2024 presentation By a 16-bit float operation: • 12FLOP/cycle per a single MAB • Totally 524,288 FLOP/cycle by a single die (512 x 16 x 8 x 8) Hierarchically tiled arithmetic blocks L2B: Level-2 Broadcasting Block L1B: Level-1 Broadcasting Block MAB: Matrix Arithmetic Block MAU: Matrix Arithmetic Unit
  15. 19 Vast Local Memory • Two GRFs (256 64-bit words

    each) 1R1W • Two Local memories (2048 64-bit words each) 1RW Each PE has its own very large Local Memory and Register File. (Not a shared memory !!) PE
  16. 20 Overview of Controlling MN-Core Matrix Array Unit (MAU) Arithmetic

    Logic Unit (ALU) DRAM Host Memory MN-Core DirectConnect
  17. 21 Explicit Data Transfer by Cacheless Architecture L1BM L1BM L2BM

    PE PE PE PE PE PE $ $ DRAM L1BM PE PE L1BM PE PE PE L2BM L1BM PE PE PE $ $ Because there is no data cache, software can explicitly specify when from where, and to where data should be transferred The explicit data transfer enables software to do broadcasting or reduction operation at the same time
  18. 23 ①Efficiency Improvements with Reduction Operations • For example, in

    Convolution, a sum is taken along the channel-wise • On a GPU, each compute unit (SM, etc.) computes the convolution for each channel, or computes partial sums that add together several channels, and then finally one compute unit takes the total sum through the L2 cache • On MN-Core, the total sum can be taken while moving the computation results from one hierarchy to another • Time for data transfers can be utilized to perform reduction, eliminating separate reduction steps. Compute Units Do Not Sit Idle for Reduction Operations Explicit Data Transfer
  19. 24 ②Easy Synchronization, with Less Waiting Time • No explicit

    synchronization is needed during execution • This is because all MABs follow a single instruction stream and execute in lockstep • In addition, memory access, communication, and synchronization are statically scheduled at compile time • As a result, no dynamic synchronization or waiting is required, significantly reducing idle time → MN-Core achieves high compute efficiency can be achieved with minimal effort No software barriers for synchronization are required, as all computation completes within a few clock time differences. Ultra-wide SIMD
  20. 25 ③Efficient Bandwidth Utilization • In conventional architectures, part of

    the bandwidth is used for cache coherency and dynamic data movement • To sustain high performance, large interconnect bandwidth is required in modern multi-die systems • In MN-Core, data placement is explicitly managed by software, so unnecessary data movement is minimized → Bandwidth can be used more directly for computation → Higher bandwidth utilization and hardware efficiency Bandwidth is primarily used for actual data movement Explicit Data Transfer
  21. 26 ④ Large Local Capacity per Compute Unit “Vast Local

    Memory” keep more data close to computation FP16 performance per Matrix Unit Register capacity per Matrix Unit Description RTX5090 Tensor Core 616 GFLOPS 63.8KB Derived from 419TFLOPS / 680 Tensor Cores. 32 bits x 32 threads x 255 registers / thread MN-Core2 MAB 384 GFLOPS 144.0KB 4 PEs x {GRF 4KB + LM 32KB} https://images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf https://docs.nvidia.com/cuda/blackwell-tuning-guide/index.html Vast Local Memory • MN-Core provides large local memory (register + local memory) per compute unit • This allows more data to stay on-chip during kernel execution • Improves data reuse and reduces repeated memory access → More efficient kernel execution, especially for loop-based workloads
  22. 27 ⑤Predictable Power Consumption for stable operation • In recent

    semiconductor industry, power consumption is getting bigger and bigger like 1.x kW, and the supply voltages are kept around 0.7V or so. ◦ Thus, current consumption is getting bigger like 2kA i(t)=P(t)/V • Difficulty is that current varies much according to the compute load. → Supply voltage could vary much ◦ parasitic inductances matter: v(t) = V0 - R*i(t) + L*di(t)/dt Deterministic nature with the wide SIMD units and the explicit data transfer allows the software to predict and control the power profile Ultra-wide SIMD Explicit Data Transfer L MN-Core 0.7V V0 i(t) v(t) R
  23. 28 ⑤Predictable Power Consumption for stable operation • A sudden

    current changes lead voltage spikes, which could cause malfunction if the voltage goes beneath “Vmin”. So gradual change in compute load is key for the stable operation. • Although MN-Core could cause a sudden changes due to its wide SIMD operation, the deterministic nature allows to predict the changes. The software can utilize the control knobs like clock frequency, nop insertion etc. to suppress sudden changes. • In other words, MN-Core doesn’t require unnecessary guard bands and can exploit higher performance while maintaining stable operation. ◦ Of course, users don’t have to care this control Deterministic nature with the wide SIMD units and the explicit data transfer allows the software to predict and control the power profile i(t) v(t) voltage waveform current waveform Huge current (A sudden compute load) V0 Vmin malfunction Ultra-wide SIMD Explicit Data Transfer
  24. MN-Core L1000: LLM inference accelerator (Und. Dev) Vertical integration of

    3D DRAM and logic wafer. Outstanding high memory bandwidth and large capacity. 3D DRAM High-density MN-Core Practical capacity for LLM 3D DRAM achieves capacity equivalent to HBM. Ultra-High Speed Vertical Connection Exceeds HBM bandwidth Reduction of Inter-Wafer Distance achieves communication speeds equivalent to SRAM
  25. MN-Core L1000 L1 L2 L1 ・・ ・・・ L1 ・・・ GPU

    3D Stacked DRAM Processing Unit L2 Cache L1 Cache L1 Cache L1 Cache Memory Processing Unit Processing Unit Processing Unit • Distributed memory architecture at the PE level ◦ Pair of PE and 3D DRAM in a tile pattern • Low power consumption and deterministic power control GPU uses shared memory (L2) - Coherent Protocol in overall L1000 uses distributed memory - Local access at local memory High compatibility with 3D DRAM
  26. Memory Technology MN-Core L Series Architecture Logic Mem (HBM) NVIDIA

    SambaNova Google AWS Intel AMD etc… Logic Mem (SRAM) Logic Memory Groq Cerebras 👍 Speed 👍 Capacity 👍👍👍Speed 😐 Capacity 👍👍👍 Speed 👍 Capacity Fully Distributed Memory Architecture 3D Stacked DRAM Processing Units in chip network ①Short data move ②Data stays near-memory 3D DRAM HBM (High Bandwidth Memory) SRAM
  27. Summary of the Hardware Side • LLMs are reshaping computing

    requirements. ◦ Models are growing rapidly—larger, deeper, and more complex. • PFN provides end-to-end support. ◦ From software to hardware. ◦ Key hardware bottlenecks: compute, memory bandwidth, and power. • PFN is developing MN-Core, a dedicated AI Chips. • Key features address these challenges: ◦ Massive SIMD for parallel computation. ◦ Large local memory for efficient bandwidth use. ◦ Explicit data transfer to reduce synchronization. ◦ Simplified SIMD for better power efficiency. ◦ 3D DRAM for higher bandwidth and performance.
  28. 36 Self introduction: Akira Kawata Short history • Mar. 2020:

    Received M.S. in Informatics, Kyoto University • Apr. 2020: Joined Preferred Networks (PFN) Current work • Compiler Engineer, Preferred Networks • Worked on the MN-Core software stack • Focused on the runtime system, including the Python interface, C++ middleware, and device drivers Interests • Binary hacks • https://akawashiro.com/
  29. 37 • How we use MN-Core with the software stack

    today • How the MN-Core software stack works • How we have developed the MN-Core software stack Today’s Topics
  30. 39 We Are Using MN-Core AI-based material search (Matlantis) MN-Core

    is already used in production Image recognition (Kachaka) 3D model generation (PFN 3D-Scan) LLM inference (example)
  31. 40 • You can make your own LLM using MN-Core

    • https://playground.mn-core.com/slm-customize You Can Use MN-Core Now
  32. 41 • The SDK is available • https://dev.mn-core.com/sdk/0.4/MLSDK/docs/en/ You Can

    Write Your Own Program on MN-Core Document (EN/JP) Image generation example
  33. 43 Overview of MN-Core Software Stack MN-Core Runtime Instruction Emission

    Layout and Location Planning Graph Extraction MNCL Host Runtime MNCL Device Translator MNACC Graph Lowering I will focus on this part today
  34. 44 from mlsdk import compile def double(input): x = input["x"]

    return {"out": x * 2} sample = {"x": torch.zeros(3, 4)} compiled_double = compile( double, sample, ) result = compiled_double( {"x": torch.ones(3, 4)}) Example
  35. 45 from mlsdk import compile def double(input): x = input["x"]

    return {"out": x * 2} sample = {"x": torch.zeros(3, 4)} compiled_double = compile( double, sample, ) result = compiled_double( {"x": torch.ones(3, 4)}) Example Define a function to compile Compile the function Run a compiled function
  36. 46 from mlsdk import compile def double(input): x = input["x"]

    return {"out": x * 2} sample = {"x": torch.zeros(3, 4)} compiled_double = compile( double, sample, ) result = compiled_double( {"x": torch.ones(3, 4)}) Example Define a function to compile Compile the function Run a compiled function
  37. 47 Compiler Overview def double(input): x = input["x"] return {"out":

    x * 2} Metadata Machine instructions lpassa $lm0v $ln0v lpassa $lm8v $ln8v lpassa $lm16v $ln16v lpassa $lm24v $ln24v ……. Python source ONNX High-level IR Low-level IR Machine instructions Binary file
  38. 48 Convert a Python function into an ONNX graph by

    extracting the computation graph • ONNX is a graph format widely used in machine learning. • An edge in ONNX corresponds to an array. • A node in ONNX corresponds to an operation on arrays in Python. Python to ONNX def double(input): x = input["x"] return {"out": x * 2} Python source ONNX
  39. 49 Decide the location and layout of each ONNX edge.

    • A high-level IR graph is an extended version of ONNX. • Location describes where each array is placed. ◦ MN-Core has fast local memory and large but slower DRAM. ◦ Because MN-Core has no caches, we must schedule memory transfers explicitly. • Layout describes how arrays are mapped onto MN-Core's tree structure. ◦ See our blog[1] for details of layout. ONNX to High-Level IR Graph [1]: https://tech.preferred.jp/ja/blog/mn-core-tensor-layout/ ONNX High-level IR
  40. 50 Convert each node in the high-level IR graph into

    a low-level IR graph. • A low-level IR graph is a low-level, instruction-level intermediate representation. • Each node in the left high-level IR graph is translated into a right low-level IR graph. • So this process generates multiple low-level IR graphs from one high-level IR graph. High-Level IR Graph to Low-Level IR Graph High-level IR Low-level IR
  41. 51 Generate a sequence of machine instructions from a low-level

    IR graph. • It tries to emit shorter instructions from the given low-level IR graph. • Because MN-Core has no jump or branch instructions, a shorter instruction sequence directly leads to better performance. From Low-Level IR Graph to Machine Instructions Low-level IR lpassa $lm0v $ln0v lpassa $lm8v $ln8v lpassa $lm16v $ln16v lpassa $lm24v $ln24v ……. Machine instructions
  42. 52 Instruction sequences are packed into one binary file with

    some meta information. • After translating all low-level IR graphs, they are concatenated and packed into one binary file. • A binary file contains: ◦ Instruction sequences ◦ Input/output information ◦ Relocation information Binary File Packing lpassa $lm0v $ln0v lpassa $lm8v $ln8v lpassa $lm16v $ln16v lpassa $lm24v $ln24v ……. Machine instructions Metadata Machine instructions Binary file lpassa $lm0v $ln0v lpassa $lm8v $ln8v lpassa $lm16v $ln16v lpassa $lm24v $ln24v ……. lpassa $lm0v $ln0v lpassa $lm8v $ln8v lpassa $lm16v $ln16v lpassa $lm24v $ln24v …….
  43. 53 from mlsdk import compile def double(input): x = input["x"]

    return {"out": x * 2} sample = {"x": torch.zeros(3, 4)} compiled_double = compile( double, sample, ) result = compiled_double( {"x": torch.ones(3, 4)}) Example Define a function to compile Compile the function Run a compiled function
  44. 54 The runtime controls MN-Core from the host computer. •

    Execution of compiled functions • Data transfer between the host computer and MN-Core • MN-Core memory management What the Runtime Does Host computer MN-Core Runtime
  45. 55 Run a compiled binary on MN-Core • Fix up

    instructions in a binary file ◦ so that the addresses match the given inputs and outputs ◦ See my earlier talk[1] on relocation in the MN-Core compiler/runtime. • Send relocated instructions to MN-Core. Program Execution lpassa $lm0v $ln256v lpassa $lm8v $ln264v lpassa $lm16v $ln272v lpassa $lm24v $ln280v ……. [1] : https://speakerdeck.com/pfn/20241213_pfn_camphor Metadata Machine instructions Binary file MN-Core Relocated instructions Rewrite addresses on the fly
  46. 56 Transfer arrays between the host computer and MN-Core •

    Data conversion between a Python array and the MN-Core data format. ◦ Data reordering ◦ Floating-point conversion ▪ For example: IEEE 754 32-bit float ⇔ MN-Core 2 block float • Transfer arrays between the host computer and MN-Core. ◦ Split the data into chunks that fit the DMA unit ◦ Control Direct Memory Access (DMA) Data Transfer Between Host and Device Python array = [0, 1, 2, …. ] MN-Core data format 01010101... MN-Core
  47. 57 Manage MN-Core DRAM memory • An array object in

    Python corresponds to a region in device memory. ◦ For example, result in the code corresponds to a region in device memory. • That device memory is freed automatically when the corresponding object is destroyed. Device Memory Management compiled_double = compile( double, sample, ) result = compiled_double( {"x": torch.ones(3, 4)}) MN-Core Device Memory
  48. 59 • We have continuously improved our software stack by

    using it internally. • We have ported many applications to MN-Core by ourselves. ◦ ResNet50 ◦ Network Architecture Search ◦ LLM training…etc • Application porting consists of the following steps ◦ Rewrite Python code ◦ Emit a valid high-level IR graph ◦ Implement missing operators ◦ Debug ◦ Performance tuning ◦ Deliver the result How We Have Developed the MN-Core Software Stack
  49. 60 • Machine learning applications use many libraries. • We

    need to peel it back and re-implement the core part. • Sometimes, we need to change the algorithm also because of hardware characteristics. Rewriting Python to Fit Our Software Stack def double(input): x = input["x"] return {"out": x * 2} result = double({"x": torch.ones(3, 4)}) from mlsdk import compile def double(input): x = input["x"] return {"out": x * 2} sample = {"x": torch.zeros(3, 4)} compiled_double = compile( double, sample, ) result = compiled_double( {"x": torch.ones(3, 4)})
  50. 61 Sometimes, the compiler fails to find a valid schedule

    and cannot emit a high-level IR graph. In such cases, • Fix the compiler itself. • Rewrite the Python script to make it more compiler-friendly. ◦ For example, our compiler handles arrays with shapes that are powers of two more easily. • Add compiler hints. Emit a Valid High-Level IR Graph # Example of compiler hints { "trigger": "Mul", "layout": "((3:4), (4:1))" }
  51. 62 • After emitting a valid high-level IR graph succeeds,

    we need to compile all operators in the high-level IR graph. • We make a big spreadsheet, divide the failing operators among teams, and implement them. Fixing Operator Implementations Zoom in
  52. 63 • In many cases, we need to debug after

    the compiler successfully emits a binary file. • Debugging tends to be hard because ◦ We build the entire software and hardware stack ourselves. ◦ Machine learning applications often appear to work even when the stack contains bugs. And that also makes bugs harder to find. ◦ Machine learning applications often take a very long time to run. Sometimes, it takes a whole day just to reproduce a single bug. • We debug by checking the most likely causes one by one. ◦ We debug by checking the most likely causes one by one, inspecting logs, machine instructions, and binary files very carefully. Debugging
  53. 64 • Problem: ◦ Software initialization does not complete. •

    Possible causes: ◦ Software deadlock ◦ Hardware FIFO overflow ◦ Failure during hardware initialization ▪ We checked hardware registers to find where execution gets stuck. Debug Example: Stuck in Software Initialization HW Module A HW Module B HW Module C HW Module E HW Module D Oops, this module doesn’t receive any data!
  54. 65 • Problem: ◦ Training loss doesn’t change. • Possible

    causes: ◦ Bug in an operator implementation ◦ Memory address mismatch ◦ Missing metadata in the binary file Debug Example: When the Loss Doesn’t Change loss iterations 9.13 1 9.13 2 9.13 3 9.13 4 9.13 5 9.13 6 9.13 7 9.13 8 9.13 9 9.13 10
  55. 66 • Problem: ◦ Loss curve doesn’t match well between

    MN-Core and GPU. ◦ A model trained on MN-Core performs worse than the same model trained on a GPU. • Possible causes: ◦ Bug in the training script ◦ Insufficient floating-point precision Debug Example: When the Trained Model Performs Poorly
  56. 67 We sometimes build debuggers to find a bug. •

    Some complex bugs cannot be resolved only by checking logs, binary files, and source code. • We built a memory-overwrite detector. ◦ Dump the allocator's memory map ◦ Dump the hardware memory map ▪ Fill the memory with a dummy value initially and read the whole area after running the application. ◦ Compare them to find mismatches • We came up with this idea from the mechanism of AddressSanitizer. See [1]. Building a Debugger Memory map of allocator Used Used Used Used Someone overwrote this by mistake! [1]: Binary Hacks Rebooted - O'Reilly Japan #47 Memory map dumped
  57. 68 Check the trace carefully and shorten the wall clock

    time • We are using Perfetto to record execution. ◦ Perfetto is a tracing tool for performance analysis. • When we find the slow task, speed it up. • When we find an unnecessary serial execution of tasks, parallelize them. Performance Tuning
  58. 69 Finally, we package the MN-Core software stack and the

    application into a Docker image and deliver it to the application team. • A Docker image ensures reproducibility. • Docker images are widely used at PFN. Delivering Compiler team Application team Docker image
  59. 70 • In the LLM era, it is increasingly important

    to satisfy compute performance, memory bandwidth, and power efficiency at the same time. • To address this, PFN is designing MN-Core’s dedicated hardware and software stack in an integrated manner. • On the hardware side, MN-Core aims to achieve high efficiency, high bandwidth, and low power consumption through massive SIMD, vast local memory, and explicit data transfer. • On the software side, PFN has built its own compiler, runtime, and tools to make MN-Core usable for real applications. • PFN’s strength lies in its ability to continuously improve both hardware and software through real-world operation. Summary