Tenstorrent HW/SW 概要説明

Tenstorrent TechTalk #1 • May 2025 • Tenstorrent Japan KK
Tenstorrent HW/SW概要説明 FAE マネージャー伊藤康宏 June 2025 Tenstorrent Japan KK

2 Tenstorrent AI Strategy • TenstorrentのAI向けVision • AI Everywhere 誰でもが利用可能な安価で高性能な製品。
mWからMW 同じアーキテクチャとSWを小型から大型システムまで提供可能。 • オープン戦略 • CPUにはX86（業界標準)、RISC-V（命令セットがオープン） • SDKをオープンソース化、デベロッパーコミュニティと協調 • Chipletにも標準化を積極的に推進 • 様々なフォームファクタで提供 • IP, Chip,Chiplet,PCIカード、サーバー、Data Center • 日本の産業、企業などのAi化、DXを加速する為 AI化、DXを推進するエンジニアの育成に貢献 • 成長戦略 • 2025年デベロッパーを育成。代理店と共に販売/保守体制の拡充 • 2026年 Data centerビジネスに本格的参入。 ANY AI MO DE L OPEN SOURCE OPEN SOURCE O P T I M I ZE D M L R E S U L T S CUSTOM OPS BUILD ANYTHING OPEN SOURCE OPEN SOURCE Compiler

Core Silicon Roadmap • 4nm Chiplet • Feature support incl.
SMC, IOMMU, AIA • Non-blocking D2D Interfaces • Composable IO, MEM, CPU compute • Details TBD Standalone AI Computer High Performance RISC-V CPU Chiplet Low Power AI Chiplet Wormhole • 80 Tensix+ Cores • 12nm • 292 TFLOPS (FP8) • 164 TFLOPS (BLOCKFP8) • 16 lanes of PCIe Gen 4.0 • 16x100 Gbps Ethernet • 6 channels GDDR6 Blackhole • 140 Tensix++ Cores • 6nm • 774 TFLOPS (FP8) • 387 FLOPS (BLOCKFP8) • 12x400 Gbps Ethernet • 48 lanes of SerDes • 8 channels of GDDR6 • 16 “Big RISC-V” CPU cores Athena Quasar • 32 Tensix NEO Cores • 4nm Chiplet • Features incl. SMC with self-boot/Reset • Non-blocking D2D interfaces • Easily stack Quasar or combine to choose your own compute Networked AI Processor AI Processor Grayskull® • 120 Tensix Cores • 12nm • 332 TFLOPS (FP8) • 83 TFLOPS (BLOCKFP8) • 16 lanes of PCIe Gen 4.0 • 8 channels LPDDR4 2021 Tapeout 2023 Product 2025 EOL 2022 Tapeout 2024 Product 2023 Tapeout 2025 Product 2025 Tapeout 2026 Product GEN 1 GEN 2 GEN 3 High Perf AI ASIC Scalability Heterogeny Chiplets EOL Now Available

Tenstorrent AIアクセラレータの基本構造: Tensix Core (詳細はSession2) Compute RISC-V 2 RISC-V 3
RISC-V 4 RISC-V 5 RISC-V 1 Router 0 L1 Memory • 5 “Baby RISC-V” Cores • 32-bit RISC-V ISA • 2 Network-on-Chip • 1.5MB SRAM Cache Compute Vector Math Engine • Tile/Matrix Math Engine • Vector Math Engine Router 1 Tile Math Engine

Feature Grayskull® Wormhole Blackhole Technology Node and Power Node Size
12nm 12nm 6nm Max Power 150W 150W 450W Cores Tensix Cores 120 80 140 CPU Cores - - 16 (4x 4-core SiFive x280) Network-on-Chip (NoC) Data Width Dual 256-bit 2D Torus Dual 256-bit 2D Torus Dual 512-bit 2D Torus Streams 64 64 64 Unicast Yes Yes Yes Multicast Yes Yes Yes* Broadcast Yes Yes Yes Memory Bus Width and Type 256-bit 3.7 GT/s LPDDR4 192-bit 12 GT/s GDDR6 256-bit 16 GT/s GDDR6 Total Capacity 8 GB 12 GB 32 GB SRAM per Tensix Core 1 MB 1.5 MB 1.5 MB Total SRAM 120 MB 120 MB 210 MB High Speed Interfaces PCI Express Gen 4.0 x16 Gen 4.0 x16 Gen 5.0 x16 Ethernet - 16x 100 Gbps 12x 400 Gbps** Shared SerDes*** - - 8 Performance Metrics Peak AICLK 1.2 GHz 1 GHz 1.35 GHz† FP8 TFLOPS 332 292 774† BLOCKFP8 TFLOPS 83 164 387† INT8 TOPs - 82 194† FP16 TFLOPS 83 82 194† TF32 TFLOPS - 82 194† Core Silicon Overview INT8が遅い → 量子化は無意味, FP8を使って欲しい INTが遅い理由はSession2で

Wormhole n150 Wormhole n300 Wormhole Products n300d (TC-02004), n300s (TC-02003)
• 128 Tensix Cores @ 1 GHz • 24GB GDDR6 RAM @ 576 GB/s • 192MB SRAM • 466 TFLOPS (FP8) 262 TFLOPS (BLOCKFP8) • 2x QSFP-DD 400GbE ports 2x Warp 100 Bridge slots • 2.5-slot (n300d), dual-slot (n300s), 300W TBP n150d (TC-02002), n150s (TC-02001) • 72 Tensix Cores @ 1 GHz • 12GB GDDR6 RAM @ 288 GB/s • 108MB SRAM • 262 TFLOPS (FP8) 148 TFLOPS (BLOCKFP8) • 2x QSFP-DD 400GbE ports 2x Warp 100 Bridge slots • 2.5-slot (n150d), dual-slot (n150s), 160W TBP n300 PCBA n300d (TC-02004) n300s (TC-02003) n150 PCBA n150d (TC-02002) n150s (TC-02001) ほとんどのCV系モデル, Stable Diffusion, 10BぐらいのLLMはN150 1枚でサポート Ethernetを使ってNoCを拡張, LLama3.3-70B等はN300 4枚で動作表示の価格とは別に輸送/保守費用がかかります約$1000 約$1500

Blackhole p100 Blackhole Products TC-03001 (p100a), active-cooled • Dual-slot, 300W
TBP, PCIe Gen5 • 28GB GDDR6, 180MB SRAM • 664TOPS(FP8), 116TFlops(FP16) • $999 Llama3.3-8B動く Blackhole p150 TC-03002 (p150b), passive-cooled • 4x QSFP-DD 800G ports • 64GB DDR • 774TOPS(FP8), 194TFlops(FP16) • Dual-slot, 300W TBP Blackhole p300 TC-03003 (p150a), active-cooled • 4x QSFP-DD 800G ports • 32GB DDR6, 210MB SRAM • 774TOPS(FP8), 194TFlops(FP16) • Dual-slot, 300W TBP TC-03004 (p150c), liquid-cooled • Details coming soon! • 4x QSFP-DD 800G ports • Single-slot, 300W/450W TBP TC-03005 (p300b), passive-cooled • Details coming soon! • 2x Warp 400 Bridge ports • Dual-slot, TBD TBP TC-03006 (p300a), active-cooled • Details coming soon! • 2x Warp 400 Bridge ports • Triple-slot, TBD TBP TC-03007 (p300c), liquid-cooled • Details coming soon! • 2x Warp 400 Bridge ports • Single-slot, TBD TBP Single ASIC No Scale-Out Single ASIC Scale-Out Dual ASIC Scale-Out “Horizon” Daughterboard • 2x or 4x QSFP-DD ports • Connects to Warp 400 Bridge ports Specifications have not been finalized. Coming soon

TT-LoudBox/Quietbox (Wormhole ) • 8WHチップによるTensor Parallelモードでllama3.3-70B, Qwen2.5-72Bが動作 • 一台で32同時アクセスまで捌ける •
より小さなLlama, Phiなどの7B~11BクラスのモデルはData Parallelモードで動作 • より多くの同時アクセスを捌ける CPU LB: 2x Intel® Xeon® Silver 4309Y QB: AMD EPYC 8124P Memory 512GB DDR5-4800 Storage 3.8TB Ethernet 2x 10GbE Tensix Processors 4x Wormhole n300s 96GB GDDR6, 768MB SRAM Connected in 2x4 mesh TeraFLOPS (FP8) 1864 TeraFLOPS (BLOCKFP8) 1048 TeraFLOPS (FP16) 524 ←200V, Rack mountable, $12,000 100V or 200V, Tower Desktop, $15,000→ 表示の価格とは別に輸送/保守費用がかかります

TT-LoudBox/TT-QuietBox (Wormhole ) Performance TT-LoudBox TT-QuietBox CPU 2x Intel® Xeon®
Silver 4309Y (8C/16T ea., up to 2.8 GHz, 105W) AMD EPYC 8124P (16C/32T, up to 3 GHz, 125W) Memory 512 GB (16x32 GB) DDR5-4800 512 GB (8x64 GB) DDR5-4800 Storage 3.8 TB U.2 PCIe 4.0 x4 4TB M.2 NVMe PCIe 4.0 x4 Ethernet 2x 10GbE 2x 10 GbE, 2x 1 GbE Tensix Processors 4x Wormhole n300s 96 GB GDDR6, 768 MB SRAM Connected in 2x4 mesh 4x Wormhole n300, Liquid-Cooled 96 GB GDDR6, 768 MB SRAM Connected in 2x4 mesh Each system features an identical Tensix Processor topology. LLMs April 2025 (t/s/u) Target (t/s/u) Batch size DeepSeek R1 Distill Llama 3.3 70B (TP=8) 15.2 20 32 Qwen 2.5 72B (TP=8) 32.5 38 32 Falcon 7B (DP=8) 15.5 26 256 CNNs April 2025 (fps) Target (fps) ResNet-50 (224x224) (DP=8) 35,800 56,000 *Performance as of 4/16/2025. DP/TP refer to parallelization; DP is “Data Parallel”, TP is “Tensor Parallel” 200V, Rack mountable, $12,000 100V or 200V, Tower Workstation, $15,000

WH Workstation内部のアクセラレータ構造 11

Systems Overview 2024 Onward Q2 2025 Wormhole Blackhole TT-QuietBox •
Liquid-cooled desktop workstation • 8 Wormhole ASICs, 96GB GDDR6 • Quiet TT-LoudBox (T3000) • Air-cooled 4U system • 8 Wormhole ASICs, 96 GB GDDR6 • Loud (no fan control) TT-QuietBox • Liquid-cooled desktop workstation • Up to 4 Blackhole ASICs, 128 GB GDDR6 • Quiet TT-RackBox 4U Server • Air-cooled 4U server • Up to 8 Blackhole ASICs, 256 GB GDDR6 • Replaces TT-LoudBox, Loud TT-DeskBox Desktop System • Air-cooled desktop workstation • Up to 2 Blackhole ASICs, 64 GB GDDR6 • Replaces TT-LoudBox, Moderate Noise

2-node 8x8 grid 4-node 8x32 grid 8-node 16x32 grid Tenstorrent
Galaxy Wormhole Server Specification Description Form Factor 6 RU Air-Cooled Accelerator 32x Wormhole @ 250W 8x per Module tray 6x 400Gbps per tray Host CPU AMD Epyc 9354P 32C/64T 3.25GHz-3.8GHz Host Mem 512GB DDR5 4800MT/s Host Network 2x 100Gbps Ethernet 1x 1Gbps Mgmt Ethernet Host Storage 2x 960GB nVME m.2 4x 4TB or 8TB e1.s PSU 4x 4000W 80+ Titanium Rated to ~13kw (can throttle down) • 32チップを6Uに集積したアクセラレータサーバ • ノードを跨いで, WHチップをメッシュ結合する • 400~600Bクラスのモデルを, 2~4ノードで運用可能 • クラスタ化によりLLMの学習をサポートする予定

Single-Galaxy Connectivity 14 11 E E E E 12 E
E E E 13 E E E E 14 E E E E 15 E E E E 16 E E E E 17 E E E E 18 E E E E 21 E E E E 22 E E E E 23 E E E E 24 E E E E 25 E E E E 26 E E E E 27 E E E E 28 E E E E 31 E E E E 32 E E E E 33 E E E E 34 E E E E 35 E E E E 36 E E E E 37 E E E E 38 E E E E 41 E E E E 42 E E E E 43 E E E E 44 E E E E 45 E E E E 46 E E E E 47 E E E E 48 E E E E 400/800Gbps QSFP-DD 400/800Gbps Chassis-Internal Galaxy scale-out built from: • Tensix Processors • Connected in grids • Via high-speed ethernet A single galaxy is a “4x8 grid” • Ends linked to form a 2D torus • Z-dimension for 3D torus available with Blackhole

Turning Tenstorrent Technology into a Winning Product Portfolio Technology Products
用途 TT-QuietBox Liquid-cooled, desk-friendly workstation n150s/n150d n300s/n300d Fully-featured PCIe cards TT-LoudBox (T3000) Rackmount systems for multi-user environments Tenstorrent Galaxy Wormhole Ultra-dense solution for maximum throughput workloads 組込みAI開発 SDK開発個人向け AI環境企業内AI システムクラウド HPC

Software

Software Ecosystem and Integrations Partners TT-Forge vLLM Python kernels TT-NN
TT-Metalium TT-LLK (low-level-kernels) PyTorch models TT-Fabric (unified scale-up and scale-out) Manually optimized models TT-Train TT- Transformer LLM training LLM inference models LLM, t2s, s2t models Jax PyTorch TF ONNX Models General: https://github.com/tenstorrent TT-Metalium : https://github.com/tenstorrent/tt-metal TT-Forge : https://github.com/tenstorrent/tt-forge Third-Party Developers In-Game AI Datacenter AI Visualization AI Training ML Compiler ML Framework AI Workloads Open Source Partners Tenstorrent Open Source Software 第2回予定第4回予定第3回予定第?回予定

Partners TT-Forge vLLM Python kernels TT-NN TT-Metalium TT-LLK (low-level-kernels) PyTorch
models TT-Fabric (unified scale-up and scale-out) Manually optimized models TT-Train TT- Transformer LLM training LLM inference models LLM, t2s, s2t models Jax PyTorch TF ONNX Models AI Workloads Open Source Partners Tenstorrent Open Source Software TT-Metalium, TT-NN

TT-Metalium : Built for AI and Scale-Out Native Multi-Device Kernels
and Ops TT-Metalium GPU Programming (詳細は第二回あたりで特集します) • Tensixコア毎に, データ出し入れ/計算の3つのKernel • 普通の?C++で書ける • SRAM/DRAM/別チップへのアクセスも自由裏を返すと最も低水準なところではTensixへのタスク割当て, メモリの管理も手動. それは流石にしんどいので TT-NNなどのライブラリは, 上記を自動でやってくれるものの上に実装. Deep Learning Ops Collective Comms Ops VS. TT-Metalium C++ Host API TT-Metalium C++ Kernel API TT-NN C++ Host API GPU Kernel Language DNN CCL

TT-NN code example def bert_output( config, hidden_states, residual, *, parameters,
): output = hidden_states @ parameters.dense.weight output = output + parameters.dense.bias output = ttnn.layer_norm( output + residual, weight=parameters.LayerNorm.weight, bias=parameters.LayerNorm.bias, epsilon=config.layer_norm_eps, ) return output 20 BERTの出力層近辺のコードを例に比較かなり直感的でわかりやすい見た目をしている. ここが性能最適化の出発点 TT-NN Pytorch class BertOutput(nn.Module): def __init__(self, config): super().__init__() self.dense = nn.Linear(config.intermediate_size, config.hidden_size) self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) def forward(self, hidden_states, input_tensor): hidden_states = self.dense(hidden_states) hidden_states = self.LayerNorm(hidden_states + input_tensor) return hidden_states

TT-Metalium : A Progressive Performance Optimization Staircase 性能を上げたい時は行列積などにハードウェア固有のパラメータ(データ型指定, 資源量制限)を足す
さらに追い込みたいときは, MetaliumでFused Kernelを書いたりする

Tracer Performance Analysis (Pareto) Profiler: 5 “Baby RISC-V” Cores (Tensix
Core) Tools: TT-NN Model Bring-Up and Optimization Device & Host Profiler

LLM models support list using TT-NN/TT-transformer 23 日本のLLMサポート実績 • Qwen
• abeja/ABEJA-Qwen2.5-32b-Japanese-v0.1 • cyberagent/DeepSeek-R1-Distill-Qwen-32B- Japanese • llama • elyza/Llama-3-ELYZA-JP-8B • sbintuitions/sarashina2-70b • tokyotech-llm/Llama-3.1-Swallow-8B- Instruct-v0.3 • Mistral • Rakuten/RakutenAI-7B-instruct Tenstorrentに最適化されたLLMモデルを公開中, https://github.com/tenstorrent/tt-metal?tab=readme-ov-file#llms

LLM関連SDK 開封から1時間でAI Chatが動く • Tenstorrent/tt_transformers: • tt-NNで最適化実装した, LLMで一般的なTransformerの部品ライブラリ • LLama3.1
を実装した時の部品＋αで, Qwen2, Mistral, Phi3がサポートできている • HuggingFaceからDLしたモデルをそのままロードできるようにI/Fができている • Tenstorrent/vllm: • TenstorrentがforkしたvLLM • TT-NN/tt_transformerで実装したLLMを簡単に取り込める • GPUなどと特に変わらない look& feel • TT-NN, vLLMを入れたDocker Imageを提供中 • 詳しくは https://github.com/tenstorrent/tt-inference-server • 必要なのはおおむね3ステップだけ • デバイスドライバ導入, HuggingaceからモデルDL , Docker run • あとはAPIの振り分け先をコンテナに向けるだけで動作 24

Tenstorrent製 ChatUI リファレンス https://github.com/tenstorrent/tt-studio 25 RAG, 性能測定, モデル変更の機能付き！

TT-Forge:手取り早くモデルを動かすグラフコンパイラ Partners TT-Forge vLLM Python kernels TT-NN TT-Metalium TT-LLK (low-level-kernels)
PyTorch models TT-Fabric (unified scale-up and scale-out) Manually optimized models TT-Train TT- Transformer LLM training LLM inference models LLM, t2s, s2t models Jax PyTorch TF ONNX Models AI Workloads Open Source Partners Tenstorrent Open Source Software

TT-Forge features Generality 主要なMLフレームワークからのモデル入力に対応 Performance outboxでとりあえず動作デバイス固有の設定を追加し, 性能向上も可能 Tooling
モデル開発, 性能改善を支援するプロファイラも同時に開発中 (tt-NNのプロファイラも使える)

TT-Forgeの使い所 • TT-NNとTT-Forgeの使い分け o TT-NNにモデルの実装がある場合: ▪ → TT-NNの実装を利用（※現状、最もパフォーマンスが出る o TT-NNにモデルの実装がない場合:
▪ → TT-NNを使って自前で実装（※ TenstorrentデバイスやTTNNの知識が必要 ▪ → TT-Forgeでコンパイル • TT-Forgeは2025.05時点では開発中のα版ステータス o 2025.Q3-4で安定版のStable Release予定 • E2Eで動作しているモデルは2025.5時点で205 models o Vision系: ResNet, DenseNet, MobileNet, ViT, HRNet, EfficientNet, Yolo, etc o Text系: BERT, Llama3.2 3B, Perceiver IO, ALBERT, etc o その他...

Tenstorrent Enables HPC Language/Code Support Topologies for Every Application ANY
HPC Cod e OPEN SOURCE OPEN SOURCE OPEN SOURCE OPEN SOURCE OPEN SOURCE OPEN SOURCE DragonFly: 200 BH max hop = 3, max latency = 1.5us (smaller 36-unit version shown) 5D-HyperTorus: 1024 BH max hop = 10, max latency = 5us

Tenstorrent Open Source Software • TT-Forge – MLIR-based compiler integrated
into various frameworks; AI/ML models from domain-specific compilers to custom kernel generation • TT-NN – Library of optimized operators • ATen coverage • PyTorch-like API • TT-Metalium – Low-level programming model and entry point • Build your own kernels • User-facing host API ANY AI MODEL OPEN SOURCE OPEN SOURCE AI/ML + HPC Developers Model Developers cuDNN / Operator Developers CUDA Developers DIRECT TO MET AL OP TIMIZED TT-NN C++ BUILD ANY THING LIBRARY OF OPS OPEN SOURCE TT-NN C++ Code HP C WO RK LOADS

Media情報 Youtube: - 開発者が顔出しでライブラリの使い所などをしゃべる動画を公開中 - https://www.youtube.com/@tenstorrentinc Github: https://github.com/orgs/tenstorrent Discord: https://discord.gg/tenstorrent

評価にご興味がある方 - Webから直販 - 代理店(マクニカ様, Networld様)経由, サポート付き - クラウド(Koyeb, UnsungFields,
TT-Cloud) - お問い合わせは [email protected]まで

Tenstorrent HW/SW 概要説明

Tenstorrent HW/SW 概要説明

Tenstorrent Japan

More Decks by Tenstorrent Japan

Other Decks in Technology

Featured

Transcript

Tenstorrent TechTalk #1 • May 2025 • Tenstorrent Japan KK

2 Tenstorrent AI Strategy • TenstorrentのAI向けVision • AI Everywhere 誰でもが利用可能な安価で高性能な製品。

Core Silicon Roadmap • 4nm Chiplet • Feature support incl.

Tenstorrent AIアクセラレータの基本構造: Tensix Core (詳細はSession2) Compute RISC-V 2 RISC-V 3

Feature Grayskull® Wormhole Blackhole Technology Node and Power Node Size

Wormhole n150 Wormhole n300 Wormhole Products n300d (TC-02004), n300s (TC-02003)

Blackhole p100 Blackhole Products TC-03001 (p100a), active-cooled • Dual-slot, 300W

TT-LoudBox/Quietbox (Wormhole ) • 8WHチップによるTensor Parallelモードでllama3.3-70B, Qwen2.5-72Bが動作 • 一台で32同時アクセスまで捌ける •

TT-LoudBox/TT-QuietBox (Wormhole ) Performance TT-LoudBox TT-QuietBox CPU 2x Intel® Xeon®

WH Workstation内部のアクセラレータ構造 11

Systems Overview 2024 Onward Q2 2025 Wormhole Blackhole TT-QuietBox •

2-node 8x8 grid 4-node 8x32 grid 8-node 16x32 grid Tenstorrent

Single-Galaxy Connectivity 14 11 E E E E 12 E

Turning Tenstorrent Technology into a Winning Product Portfolio Technology Products

Software

Software Ecosystem and Integrations Partners TT-Forge vLLM Python kernels TT-NN

Partners TT-Forge vLLM Python kernels TT-NN TT-Metalium TT-LLK (low-level-kernels) PyTorch

TT-Metalium : Built for AI and Scale-Out Native Multi-Device Kernels

TT-NN code example def bert_output( config, hidden_states, residual, *, parameters,

TT-Metalium : A Progressive Performance Optimization Staircase 性能を上げたい時は行列積などにハードウェア固有のパラメータ(データ型指定, 資源量制限)を足す

Tracer Performance Analysis (Pareto) Profiler: 5 “Baby RISC-V” Cores (Tensix

LLM models support list using TT-NN/TT-transformer 23 日本のLLMサポート実績 • Qwen

LLM関連SDK 開封から1時間でAI Chatが動く • Tenstorrent/tt_transformers: • tt-NNで最適化実装した, LLMで一般的なTransformerの部品ライブラリ • LLama3.1

Tenstorrent製 ChatUI リファレンス https://github.com/tenstorrent/tt-studio 25 RAG, 性能測定, モデル変更の機能付き！

TT-Forge:手取り早くモデルを動かすグラフコンパイラ Partners TT-Forge vLLM Python kernels TT-NN TT-Metalium TT-LLK (low-level-kernels)

TT-Forge features Generality 主要なMLフレームワークからのモデル入力に対応 Performance outboxでとりあえず動作デバイス固有の設定を追加し, 性能向上も可能 Tooling

TT-Forgeの使い所 • TT-NNとTT-Forgeの使い分け o TT-NNにモデルの実装がある場合: ▪ → TT-NNの実装を利用（※現状、最もパフォーマンスが出る o TT-NNにモデルの実装がない場合:

Tenstorrent Enables HPC Language/Code Support Topologies for Every Application ANY

Tenstorrent Open Source Software • TT-Forge – MLIR-based compiler integrated

Media情報 Youtube: - 開発者が顔出しでライブラリの使い所などをしゃべる動画を公開中 - https://www.youtube.com/@tenstorrentinc Github: https://github.com/orgs/tenstorrent Discord: https://discord.gg/tenstorrent

評価にご興味がある方 - Webから直販 - 代理店(マクニカ様, Networld様)経由, サポート付き - クラウド(Koyeb, UnsungFields,