Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tenstorrent HW/SW 概要説明

Tenstorrent HW/SW 概要説明

Tenstorrent Tech Talk #1のSession1

Avatar for Tenstorrent Japan

Tenstorrent Japan

June 13, 2025
Tweet

More Decks by Tenstorrent Japan

Other Decks in Technology

Transcript

  1. Tenstorrent TechTalk #1 • May 2025 • Tenstorrent Japan KK

    Tenstorrent HW/SW概要説明 FAE マネージャー 伊藤康宏 June 2025 Tenstorrent Japan KK
  2. 2 Tenstorrent AI Strategy • TenstorrentのAI向けVision • AI Everywhere 誰でもが利用可能な安価で高性能な製品。

    mWからMW 同じアーキテクチャとSWを小型から大型システムまで提供可能。 • オープン戦略 • CPUにはX86(業界標準)、RISC-V(命令セットがオープン) • SDKをオープンソース化、デベロッパーコミュニティと協調 • Chipletにも標準化を積極的に推進 • 様々なフォームファクタで提供 • IP, Chip,Chiplet,PCIカード、サーバー、Data Center • 日本の産業、企業などのAi化、DXを加速する為 AI化、DXを推進するエンジニアの育成に貢献 • 成長戦略 • 2025年 デベロッパーを育成。代理店と共に販売/保守体制の拡充 • 2026年 Data centerビジネスに本格的参入。 ANY AI MO DE L OPEN SOURCE OPEN SOURCE O P T I M I ZE D M L R E S U L T S CUSTOM OPS BUILD ANYTHING OPEN SOURCE OPEN SOURCE Compiler
  3. Core Silicon Roadmap • 4nm Chiplet • Feature support incl.

    SMC, IOMMU, AIA • Non-blocking D2D Interfaces • Composable IO, MEM, CPU compute • Details TBD Standalone AI Computer High Performance RISC-V CPU Chiplet Low Power AI Chiplet Wormhole • 80 Tensix+ Cores • 12nm • 292 TFLOPS (FP8) • 164 TFLOPS (BLOCKFP8) • 16 lanes of PCIe Gen 4.0 • 16x100 Gbps Ethernet • 6 channels GDDR6 Blackhole • 140 Tensix++ Cores • 6nm • 774 TFLOPS (FP8) • 387 FLOPS (BLOCKFP8) • 12x400 Gbps Ethernet • 48 lanes of SerDes • 8 channels of GDDR6 • 16 “Big RISC-V” CPU cores Athena Quasar • 32 Tensix NEO Cores • 4nm Chiplet • Features incl. SMC with self-boot/Reset • Non-blocking D2D interfaces • Easily stack Quasar or combine to choose your own compute Networked AI Processor AI Processor Grayskull® • 120 Tensix Cores • 12nm • 332 TFLOPS (FP8) • 83 TFLOPS (BLOCKFP8) • 16 lanes of PCIe Gen 4.0 • 8 channels LPDDR4 2021 Tapeout 2023 Product 2025 EOL 2022 Tapeout 2024 Product 2023 Tapeout 2025 Product 2025 Tapeout 2026 Product GEN 1 GEN 2 GEN 3 High Perf AI ASIC Scalability Heterogeny Chiplets EOL Now Available
  4. Tenstorrent AIアクセラレータの基本構造: Tensix Core (詳細はSession2) Compute RISC-V 2 RISC-V 3

    RISC-V 4 RISC-V 5 RISC-V 1 Router 0 L1 Memory • 5 “Baby RISC-V” Cores • 32-bit RISC-V ISA • 2 Network-on-Chip • 1.5MB SRAM Cache Compute Vector Math Engine • Tile/Matrix Math Engine • Vector Math Engine Router 1 Tile Math Engine
  5. Feature Grayskull® Wormhole Blackhole Technology Node and Power Node Size

    12nm 12nm 6nm Max Power 150W 150W 450W Cores Tensix Cores 120 80 140 CPU Cores - - 16 (4x 4-core SiFive x280) Network-on-Chip (NoC) Data Width Dual 256-bit 2D Torus Dual 256-bit 2D Torus Dual 512-bit 2D Torus Streams 64 64 64 Unicast Yes Yes Yes Multicast Yes Yes Yes* Broadcast Yes Yes Yes Memory Bus Width and Type 256-bit 3.7 GT/s LPDDR4 192-bit 12 GT/s GDDR6 256-bit 16 GT/s GDDR6 Total Capacity 8 GB 12 GB 32 GB SRAM per Tensix Core 1 MB 1.5 MB 1.5 MB Total SRAM 120 MB 120 MB 210 MB High Speed Interfaces PCI Express Gen 4.0 x16 Gen 4.0 x16 Gen 5.0 x16 Ethernet - 16x 100 Gbps 12x 400 Gbps** Shared SerDes*** - - 8 Performance Metrics Peak AICLK 1.2 GHz 1 GHz 1.35 GHz† FP8 TFLOPS 332 292 774† BLOCKFP8 TFLOPS 83 164 387† INT8 TOPs - 82 194† FP16 TFLOPS 83 82 194† TF32 TFLOPS - 82 194† Core Silicon Overview INT8が遅い → 量子化は無意味, FP8を使って欲しい INTが遅い理由はSession2で
  6. Wormhole n150 Wormhole n300 Wormhole Products n300d (TC-02004), n300s (TC-02003)

    • 128 Tensix Cores @ 1 GHz • 24GB GDDR6 RAM @ 576 GB/s • 192MB SRAM • 466 TFLOPS (FP8) 262 TFLOPS (BLOCKFP8) • 2x QSFP-DD 400GbE ports 2x Warp 100 Bridge slots • 2.5-slot (n300d), dual-slot (n300s), 300W TBP n150d (TC-02002), n150s (TC-02001) • 72 Tensix Cores @ 1 GHz • 12GB GDDR6 RAM @ 288 GB/s • 108MB SRAM • 262 TFLOPS (FP8) 148 TFLOPS (BLOCKFP8) • 2x QSFP-DD 400GbE ports 2x Warp 100 Bridge slots • 2.5-slot (n150d), dual-slot (n150s), 160W TBP n300 PCBA n300d (TC-02004) n300s (TC-02003) n150 PCBA n150d (TC-02002) n150s (TC-02001) ほとんどのCV系モデル, Stable Diffusion, 10BぐらいのLLMはN150 1枚でサポート Ethernetを使ってNoCを拡張, LLama3.3-70B等はN300 4枚で動作 表示の価格とは別に輸送/保守費用がかかります 約$1000 約$1500
  7. Blackhole p100 Blackhole Products TC-03001 (p100a), active-cooled • Dual-slot, 300W

    TBP, PCIe Gen5 • 28GB GDDR6, 180MB SRAM • 664TOPS(FP8), 116TFlops(FP16) • $999 Llama3.3-8B動く Blackhole p150 TC-03002 (p150b), passive-cooled • 4x QSFP-DD 800G ports • 64GB DDR • 774TOPS(FP8), 194TFlops(FP16) • Dual-slot, 300W TBP Blackhole p300 TC-03003 (p150a), active-cooled • 4x QSFP-DD 800G ports • 32GB DDR6, 210MB SRAM • 774TOPS(FP8), 194TFlops(FP16) • Dual-slot, 300W TBP TC-03004 (p150c), liquid-cooled • Details coming soon! • 4x QSFP-DD 800G ports • Single-slot, 300W/450W TBP TC-03005 (p300b), passive-cooled • Details coming soon! • 2x Warp 400 Bridge ports • Dual-slot, TBD TBP TC-03006 (p300a), active-cooled • Details coming soon! • 2x Warp 400 Bridge ports • Triple-slot, TBD TBP TC-03007 (p300c), liquid-cooled • Details coming soon! • 2x Warp 400 Bridge ports • Single-slot, TBD TBP Single ASIC No Scale-Out Single ASIC Scale-Out Dual ASIC Scale-Out “Horizon” Daughterboard • 2x or 4x QSFP-DD ports • Connects to Warp 400 Bridge ports Specifications have not been finalized. Coming soon
  8. TT-LoudBox/Quietbox (Wormhole ) • 8WHチップによるTensor Parallelモードでllama3.3-70B, Qwen2.5-72Bが動作 • 一台で32同時アクセスまで捌ける •

    より小さなLlama, Phiなどの7B~11BクラスのモデルはData Parallelモードで動作 • より多くの同時アクセスを捌ける CPU LB: 2x Intel® Xeon® Silver 4309Y QB: AMD EPYC 8124P Memory 512GB DDR5-4800 Storage 3.8TB Ethernet 2x 10GbE Tensix Processors 4x Wormhole n300s 96GB GDDR6, 768MB SRAM Connected in 2x4 mesh TeraFLOPS (FP8) 1864 TeraFLOPS (BLOCKFP8) 1048 TeraFLOPS (FP16) 524 ←200V, Rack mountable, $12,000 100V or 200V, Tower Desktop, $15,000→ 表示の価格とは別に輸送/保守費用がかかります
  9. TT-LoudBox/TT-QuietBox (Wormhole ) Performance TT-LoudBox TT-QuietBox CPU 2x Intel® Xeon®

    Silver 4309Y (8C/16T ea., up to 2.8 GHz, 105W) AMD EPYC 8124P (16C/32T, up to 3 GHz, 125W) Memory 512 GB (16x32 GB) DDR5-4800 512 GB (8x64 GB) DDR5-4800 Storage 3.8 TB U.2 PCIe 4.0 x4 4TB M.2 NVMe PCIe 4.0 x4 Ethernet 2x 10GbE 2x 10 GbE, 2x 1 GbE Tensix Processors 4x Wormhole n300s 96 GB GDDR6, 768 MB SRAM Connected in 2x4 mesh 4x Wormhole n300, Liquid-Cooled 96 GB GDDR6, 768 MB SRAM Connected in 2x4 mesh Each system features an identical Tensix Processor topology. LLMs April 2025 (t/s/u) Target (t/s/u) Batch size DeepSeek R1 Distill Llama 3.3 70B (TP=8) 15.2 20 32 Qwen 2.5 72B (TP=8) 32.5 38 32 Falcon 7B (DP=8) 15.5 26 256 CNNs April 2025 (fps) Target (fps) ResNet-50 (224x224) (DP=8) 35,800 56,000 *Performance as of 4/16/2025. DP/TP refer to parallelization; DP is “Data Parallel”, TP is “Tensor Parallel” 200V, Rack mountable, $12,000 100V or 200V, Tower Workstation, $15,000
  10. Systems Overview 2024 Onward Q2 2025 Wormhole Blackhole TT-QuietBox •

    Liquid-cooled desktop workstation • 8 Wormhole ASICs, 96GB GDDR6 • Quiet TT-LoudBox (T3000) • Air-cooled 4U system • 8 Wormhole ASICs, 96 GB GDDR6 • Loud (no fan control) TT-QuietBox • Liquid-cooled desktop workstation • Up to 4 Blackhole ASICs, 128 GB GDDR6 • Quiet TT-RackBox 4U Server • Air-cooled 4U server • Up to 8 Blackhole ASICs, 256 GB GDDR6 • Replaces TT-LoudBox, Loud TT-DeskBox Desktop System • Air-cooled desktop workstation • Up to 2 Blackhole ASICs, 64 GB GDDR6 • Replaces TT-LoudBox, Moderate Noise
  11. 2-node 8x8 grid 4-node 8x32 grid 8-node 16x32 grid Tenstorrent

    Galaxy Wormhole Server Specification Description Form Factor 6 RU Air-Cooled Accelerator 32x Wormhole @ 250W 8x per Module tray 6x 400Gbps per tray Host CPU AMD Epyc 9354P 32C/64T 3.25GHz-3.8GHz Host Mem 512GB DDR5 4800MT/s Host Network 2x 100Gbps Ethernet 1x 1Gbps Mgmt Ethernet Host Storage 2x 960GB nVME m.2 4x 4TB or 8TB e1.s PSU 4x 4000W 80+ Titanium Rated to ~13kw (can throttle down) • 32チップを6Uに集積したアクセラレータサーバ • ノードを跨いで, WHチップをメッシュ結合する • 400~600Bクラスのモデルを, 2~4ノードで運用可能 • クラスタ化によりLLMの学習をサポートする予定
  12. Single-Galaxy Connectivity 14 11 E E E E 12 E

    E E E 13 E E E E 14 E E E E 15 E E E E 16 E E E E 17 E E E E 18 E E E E 21 E E E E 22 E E E E 23 E E E E 24 E E E E 25 E E E E 26 E E E E 27 E E E E 28 E E E E 31 E E E E 32 E E E E 33 E E E E 34 E E E E 35 E E E E 36 E E E E 37 E E E E 38 E E E E 41 E E E E 42 E E E E 43 E E E E 44 E E E E 45 E E E E 46 E E E E 47 E E E E 48 E E E E 400/800Gbps QSFP-DD 400/800Gbps Chassis-Internal Galaxy scale-out built from: • Tensix Processors • Connected in grids • Via high-speed ethernet A single galaxy is a “4x8 grid” • Ends linked to form a 2D torus • Z-dimension for 3D torus available with Blackhole
  13. Turning Tenstorrent Technology into a Winning Product Portfolio Technology Products

    用途 TT-QuietBox Liquid-cooled, desk-friendly workstation n150s/n150d n300s/n300d Fully-featured PCIe cards TT-LoudBox (T3000) Rackmount systems for multi-user environments Tenstorrent Galaxy Wormhole Ultra-dense solution for maximum throughput workloads 組込みAI開発 SDK開発 個人向け AI環境 企業内AI システム クラウド HPC
  14. Software Ecosystem and Integrations Partners TT-Forge vLLM Python kernels TT-NN

    TT-Metalium TT-LLK (low-level-kernels) PyTorch models TT-Fabric (unified scale-up and scale-out) Manually optimized models TT-Train TT- Transformer LLM training LLM inference models LLM, t2s, s2t models Jax PyTorch TF ONNX Models General: https://github.com/tenstorrent TT-Metalium : https://github.com/tenstorrent/tt-metal TT-Forge : https://github.com/tenstorrent/tt-forge Third-Party Developers In-Game AI Datacenter AI Visualization AI Training ML Compiler ML Framework AI Workloads Open Source Partners Tenstorrent Open Source Software 第2回予定 第4回予定 第3回予定 第?回予定
  15. Partners TT-Forge vLLM Python kernels TT-NN TT-Metalium TT-LLK (low-level-kernels) PyTorch

    models TT-Fabric (unified scale-up and scale-out) Manually optimized models TT-Train TT- Transformer LLM training LLM inference models LLM, t2s, s2t models Jax PyTorch TF ONNX Models AI Workloads Open Source Partners Tenstorrent Open Source Software TT-Metalium, TT-NN
  16. TT-Metalium : Built for AI and Scale-Out Native Multi-Device Kernels

    and Ops TT-Metalium GPU Programming (詳細は第二回あたりで特集します) • Tensixコア毎に, データ出し入れ/計算の3つのKernel • 普通の?C++で書ける • SRAM/DRAM/別チップへのアクセスも自由 裏を返すと 最も低水準なところではTensixへのタスク割当て, メモリの管理も手動. それは流石にしんどいので TT-NNなどのライブラリは, 上記を自動でやってく れるものの上に実装. Deep Learning Ops Collective Comms Ops VS. TT-Metalium C++ Host API TT-Metalium C++ Kernel API TT-NN C++ Host API GPU Kernel Language DNN CCL
  17. TT-NN code example def bert_output( config, hidden_states, residual, *, parameters,

    ): output = hidden_states @ parameters.dense.weight output = output + parameters.dense.bias output = ttnn.layer_norm( output + residual, weight=parameters.LayerNorm.weight, bias=parameters.LayerNorm.bias, epsilon=config.layer_norm_eps, ) return output 20 BERTの出力層近辺のコードを例に比較 かなり直感的でわかりやすい見た目をしている. ここが性能最適化の出発点 TT-NN Pytorch class BertOutput(nn.Module): def __init__(self, config): super().__init__() self.dense = nn.Linear(config.intermediate_size, config.hidden_size) self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) def forward(self, hidden_states, input_tensor): hidden_states = self.dense(hidden_states) hidden_states = self.LayerNorm(hidden_states + input_tensor) return hidden_states
  18. Tracer Performance Analysis (Pareto) Profiler: 5 “Baby RISC-V” Cores (Tensix

    Core) Tools: TT-NN Model Bring-Up and Optimization Device & Host Profiler
  19. LLM models support list using TT-NN/TT-transformer 23 日本のLLMサポート実績 • Qwen

    • abeja/ABEJA-Qwen2.5-32b-Japanese-v0.1 • cyberagent/DeepSeek-R1-Distill-Qwen-32B- Japanese • llama • elyza/Llama-3-ELYZA-JP-8B • sbintuitions/sarashina2-70b • tokyotech-llm/Llama-3.1-Swallow-8B- Instruct-v0.3 • Mistral • Rakuten/RakutenAI-7B-instruct Tenstorrentに最適化されたLLMモデルを公開中, https://github.com/tenstorrent/tt-metal?tab=readme-ov-file#llms
  20. LLM関連SDK 開封から1時間でAI Chatが動く • Tenstorrent/tt_transformers: • tt-NNで最適化実装した, LLMで一般的なTransformerの部品ライブラリ • LLama3.1

    を実装した時の部品+αで, Qwen2, Mistral, Phi3がサポートできている • HuggingFaceからDLしたモデルをそのままロードできるようにI/Fができている • Tenstorrent/vllm: • TenstorrentがforkしたvLLM • TT-NN/tt_transformerで実装したLLMを簡単に取り込める • GPUなどと特に変わらない look& feel • TT-NN, vLLMを入れたDocker Imageを提供中 • 詳しくは https://github.com/tenstorrent/tt-inference-server • 必要なのはおおむね3ステップだけ • デバイスドライバ導入, HuggingaceからモデルDL , Docker run • あとはAPIの振り分け先をコンテナに向けるだけで動作 24
  21. TT-Forge:手取り早くモデルを動かすグラフコンパイラ Partners TT-Forge vLLM Python kernels TT-NN TT-Metalium TT-LLK (low-level-kernels)

    PyTorch models TT-Fabric (unified scale-up and scale-out) Manually optimized models TT-Train TT- Transformer LLM training LLM inference models LLM, t2s, s2t models Jax PyTorch TF ONNX Models AI Workloads Open Source Partners Tenstorrent Open Source Software
  22. TT-Forgeの使い所 • TT-NNとTT-Forgeの使い分け o TT-NNにモデルの実装がある場合: ▪ → TT-NNの実装を利用(※現状、最もパフォーマンスが出る o TT-NNにモデルの実装がない場合:

    ▪ → TT-NNを使って自前で実装(※ TenstorrentデバイスやTTNNの知識が必要 ▪ → TT-Forgeでコンパイル • TT-Forgeは2025.05時点では開発中のα版ステータス o 2025.Q3-4で安定版のStable Release予定 • E2Eで動作しているモデルは2025.5時点で205 models o Vision系: ResNet, DenseNet, MobileNet, ViT, HRNet, EfficientNet, Yolo, etc o Text系: BERT, Llama3.2 3B, Perceiver IO, ALBERT, etc o その他...
  23. Tenstorrent Enables HPC Language/Code Support Topologies for Every Application ANY

    HPC Cod e OPEN SOURCE OPEN SOURCE OPEN SOURCE OPEN SOURCE OPEN SOURCE OPEN SOURCE DragonFly: 200 BH max hop = 3, max latency = 1.5us (smaller 36-unit version shown) 5D-HyperTorus: 1024 BH max hop = 10, max latency = 5us
  24. Tenstorrent Open Source Software • TT-Forge – MLIR-based compiler integrated

    into various frameworks; AI/ML models from domain-specific compilers to custom kernel generation • TT-NN – Library of optimized operators • ATen coverage • PyTorch-like API • TT-Metalium – Low-level programming model and entry point • Build your own kernels • User-facing host API ANY AI MODEL OPEN SOURCE OPEN SOURCE AI/ML + HPC Developers Model Developers cuDNN / Operator Developers CUDA Developers DIRECT TO MET AL OP TIMIZED TT-NN C++ BUILD ANY THING LIBRARY OF OPS OPEN SOURCE TT-NN C++ Code HP C WO RK LOADS