Dive into Triton Internals

Dive into Triton Internals https://www.kapilsharma.dev/posts/deep-dive-into-triton-internals/ Jongsu Kim

Triton? 사용하기 어려운 CUDA, 세밀한 컨트롤이 어려운 (커스텀 데이터 구조
사용문제, 스케줄링) TVM, MLIR의 중간에서 Python과 유사하여 사용하기 쉬우면서도 (@triton.jit) CUDA와 같은 고성능 GPU 코드 생성용 DSL CUDA Triton Memory Coalescing Manual Automatic Shared Memory Management Manual Automatic Scheduling (Within SMs) Manual Automatic Scheduling (Across SMs) Manual Automatic • PyTorch와 유사한 문법 • SRAM 기반 작업 • Tensor들이 GPU SRAM에 상주 • 순수 Python로 작성 후 decorator를 통해 변환 • 유연한 메모리 제어 (포인터 사용) Python Triton IR LLVM IR PTX CUDA C++ LLVM IR (NVMM) PTX SASS

How CUDA Compiler Works https://hpcgpu.mini.pw.edu.pl/cuda-compilation-toolchain/ Whole Program Compilation(Default) • 파일을
쪼개지 않고 Host 코드 먼저 처리 (Phase1) • Device(GPU) 코드를 컴파일해서 PTX와 cubin을 결합한 fatbinary (fatbin.c) 생성 (Phase 2) • fatbinary를 문자열 즉 데이터처럼 취급 • cudafe++는 CUDA structure를 standard C++처럼 취급하도 록 도와줌 • <<<>>> -> cuLaunchKernel(…) 등 • Host(CPU) code를 컴파일(Phase 2)하면서 fatbinary 주입 Separate Compilation(-dc option) • .cu를 분리하여 Host compiler(g++)과 GPU 어셈블러 (ptxas) 에 전달 • Device 코드들이 #include를 통해 host에 주입 • 독립적으로 컴파일된 오브젝트파일을 Linker를 통해 링크 Input File suffixes • .cu: CUDA 소스파일 • .ptx: PTX intermediate assembly file • Forward Compatible한 device 호환가능한 IR • .cubin:Single GPU를 위한 CUDA device binary code • .fatbin: .ptx와 .cubin의 조합 Overall Architecture

How CUDA Compiler Works https://hpcgpu.mini.pw.edu.pl/cuda-compilation-toolchain/ Host 관점 Whole Program Compilation(Default)
• 파일을 쪼개지 않고 Host 코드 먼저 처리 (Phase1) • Device(GPU) 코드를 컴파일해서 PTX와 cubin을 결합한 fatbinary (fatbin.c) 생성 (Phase 2) • fatbinary를 문자열 즉 데이터처럼 취급 • cudafe++는 CUDA structure를 standard C++처럼 취급하도 록 도와줌 • <<<>>> -> cuLaunchKernel(…) 등 • Host(CPU) code를 컴파일(Phase 2)하면서 fatbinary 주입 Separate Compilation(-dc option) • .cu를 분리하여 Host compiler(g++)과 GPU 어셈블러 (ptxas) 에 전달 • Device 코드들이 #include를 통해 host에 주입 • 독립적으로 컴파일된 오브젝트파일을 Linker를 통해 링크 Input File suffixes • .cu: CUDA 소스파일 • .ptx: PTX intermediate assembly file • Forward Compatible한 device 호환가능한 IR • .cubin:Single GPU를 위한 CUDA device binary code • .fatbin: .ptx와 .cubin의 조합

How CUDA Compiler Works https://hpcgpu.mini.pw.edu.pl/cuda-compilation-toolchain/ Device 관점 Whole Program Compilation(Default)
• 파일을 쪼개지 않고 Host 코드 먼저 처리 (Phase1) • Device(GPU) 코드를 컴파일해서 PTX와 cubin을 결합한 fatbinary (fatbin.c) 생성 (Phase 2) • fatbinary를 문자열 즉 데이터처럼 취급 • cudafe++는 CUDA structure를 standard C++처럼 취급하도 록 도와줌 • <<<>>> -> cuLaunchKernel(…) 등 • Host(CPU) code를 컴파일(Phase 2)하면서 fatbinary 주입 Separate Compilation(-dc option) • .cu를 분리하여 Host compiler(g++)과 GPU 어셈블러 (ptxas) 에 전달 • Device 코드들이 #include를 통해 host에 주입 • 독립적으로 컴파일된 오브젝트파일을 Linker를 통해 링크 Input File suffixes • .cu: CUDA 소스파일 • .ptx: PTX intermediate assembly file • Forward Compatible한 device 호환가능한 IR • .cubin:Single GPU를 위한 CUDA device binary code • .fatbin: .ptx와 .cubin의 조합

How CUDA Compiler Works Dual Compilation(Default) • 파일을 쪼개지 않고
Device(GPU) 코드를 컴파일해서 PTX와 cubin 을 결합한 fatbinary 생성 (fatbin.c): Phase 1 • fatbinary를 문자열 즉 데이터처럼 취급 • Host(CPU) code를 컴파일(Phase 2)하면서 fatbinary 주입 Separated Compilation(-dc option) • .cu를 분리하여 Host compiler(g++)과 GPU 어셈블러 (ptxas) 에 전달 • 독립적으로 컴파일된 오브젝트파일을 링커를 통해 링크 https://hpcgpu.mini.pw.edu.pl/cuda-compilation-toolchain/

CUDA (Thread) vs Triton (Tile) CUDA Programming Model (SIMT) •
Scalar Program, Blocked Threads • SIMT: 하나의 명령이 여러 스레드에서 실행 • Warp(32개 스레드)를 항상 고려 • 개별 스레드가 어떤 데이터를 처리할지 수동 관 리 https://triton-lang.org/main/programming-guide/chapter-1/introduction.html Triton Programming Model (SPMD) • Blocked Program, Scalar Threads • SPMD: 하나의 프로그램이 여러 데이터에 실행 • 하나의 커널(프로그램 인스턴스)가 처리할 타일(데이터 청크)에 만 집중 • 데이터 “타일”에만 집중. 각 스레드가 타일의 어떤 부분을 맡을지는 컴 파일러가 최적화 • Sparse Matrix를 다루는 연산과 비슷

CUDA (Thread) vs Triton (Tile) CUDA Programming Model (SIMT) https://triton-lang.org/main/programming-guide/chapter-1/introduction.html
Triton Programming Model (SPMD) #pragma parallel for(int m = 0; m < M; m++) #pragma parallel for(int n = 0; n < N; n++){ float acc = 0; for(int k = 0; k < K; k++) acc += A[m, k] * B[k, n]; C[m, n] = acc; } #pragma parallel for(int m = 0; m < M; m += MB) #pragma parallel for(int n = 0; n < N; n += NB){ float acc[MB, NB] = 0; for(int k = 0; k < K; k += KB) acc += A[m:m+MB, k:k+KB] @ B[k:k+KB, n:n+NB]; C[m:m+MB, n:n+NB] = acc; }

Triton Compiler Architecture (2019) 개발자는 Triton-C로 작성하고, 최적화를 위해 Triton-IR로
변 환, Triton-JIT을 통해 최종 머신 코드로 컴파일 https://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf Triton-C • C언어와 유사한 구조 • CUDA에 익숙한 개발자나 다른 딥러닝 컴파일러 (TVM, TC)들을 위한 안정적인 frontend • Tile 변수를 통해 텐서 프로그램을 표현 • 다른 고수준 DSL에 비해 저수준 Level의 코드이지만 더 많은 유연성 제공 // 𝐶 = 𝐴𝐵𝑇 in TF, PlainML, TC and TVM C = tf.matmul(A, tf.transpose(B)) // TF C[i, j: I, J] = +(A[i, k] * B[j, k]); // PlaiML C(i, j) +=! A(i, k) * B(j, k) // TC tvm.sum(A[I, k] * B[j, k), axis=k) // TVM 다른 DSL에서의 𝐶 = 𝐴𝐵𝑇

변 환, Triton-JIT을 통해 최종 머신 코드로 컴파일 https://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf Triton-C • C언어와 유사한 구조 • CUDA에 익숙한 개발자나 다른 딥러닝 컴파일러 (TVM, TC)들을 위한 안정적 인 frontend • Tile 변수를 통해 텐서 프로그램을 표현 • 다른 고수준 DSL에 비해 저수준 Level의 코드이지만 더 많은 유연성 제공 // Tile shapes are parameteric and // can be optimized by compilation backends const tunable int TM = {16, 32, 64, 128}; const tunable int TN = {16, 32, 64, 128}; const tunable int TK = {8, 16}; // C = A * B.T kernel void matmul_nt(float *a, float *b, float *c, int M, int N, int K) { // 1D tile of indicies int rm[TM] = get_global_range(0); int rn[TN] = get_global_range(1); int rk[TK] = 0 ... TK; // 2D tile of accumulators float C[TM, TN] = 0; // 2D tile of pointers float *pa[TM, TK] = a + rm[:, newaxis] * K + rk[newaxis, :]; float *pb[TN, TK] = b + rn[:, newaxis] * K + rk[newaxis, :]; for (int k = K; k > 0; k = k - TK) { bool check_k[TK] = rk < k; bool check_a[TM, TK] = (rm < M)[:, newaxis] && check_k; bool check_b[TN, TK] = (rn < N)[:, newaxis] && check_k; // Load tile operands float A[TM, TK] = @check_a ? *pa : 0; float B[TN, TK] = @check_b ? *pb : 0; // accumulate C += dot(A, trans(B)); // Update pointers pa = pa + TK; pb = pb + TK; } // Write-back accumulators float *pc[TM, TN] = c + rm[:, newaxis] * N + rn[newaxis, :]; bool check_c[TM, TN] = (rm < M)[:, newaxis] && (rn < N)[newaxis, :]; @check_c *pc = C; } • 타일 크기의 정의: TM, TN, TK로 파라미터화되어 컴파일러가 최적화 • 작업 범위 설정: get_global_range를 통해 커널이 행렬의 어느 부분을 계산할지 결정 (1D tile of indices) • 결과 타일 초기화 (2D tile of accumulators) • 포인터 준비 (2D tile of pointers) • 반복 계산 (루프) • A, B로부터 타일조각을 SRAM으로 로드 • dot(A, trans(B))를 통해 행렬 곱셈을 수행하고 C에 더함 • 포인터 이동 • 계산이 완료된 C 타일을 다시 메모리에 write-back Triton에서의 𝐶 = 𝐴𝐵𝑇

변 환, Triton-JIT을 통해 최종 머신 코드로 컴파일 https://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf Triton-IR • LLVM을 기반으로 하는 IR이며 Triton-C를 직접 Parsing하여 구축 • LLVM-IR과 고수준 레벨의 structure를 공유하지만 tile-level data-flow 와 control-flow 분석을 위한 extension을 제공 • Tile-level에서 프로그램을 분석, 변환, 최적화하기 적합한 환경 제공 Triton-JIT • JIT(Just-in-Time) 컴파일러이자 백엔드 • Triton-IR을 LLVM Bytecode로 컴파일 • 내부 최적화 기법 적용 • Auto-Tuner • 메타 파라미터(타일 크기 등)을 최적화 하기 위해 벤치마크 루프 수행 • Machine Independent Passes • Prefetching • Tile-level Peephole Optimization • Machine Dependent Passes • Hierarchical Tiling • Memory Coalescing • Shared Memory Allocation • Shared Memory Synchronization

Triton Compiler Architecture (2022~) 2022년 기존의 Triton Compiler Infrastructure를 MLIR
기반으로 재구축하여 Frontend와 Backend 사이의 Middle Layer 역할을 수행 https://www.youtube.com/watch?v=y2V3ucS1pfQ Python DSL Triton Dialects (Triton IR) LLVM Dialects PTX SASS Triton-C Triton-IR Triton-JIT Machine Code TritonGPU Dialects TritonGPU Dialects LLVM Dialects AMD GPU isa Linalg Dialects 기존 아키텍처 MLIR 기반 아키텍처 Triton(MLIR Compiler) Backend(LLVM) LLVM Dialects HW Vendor Dialects HW Frontend(Python)

왜 Middle Layer인가? Language Agnostic, HW Agnostic Layer에 대한 필요성이
존재 https://www.youtube.com/watch?v=y2V3ucS1pfQ Language Agnostic • Use existing MLIR backends • Compatible to other languages / graphs Hardware Agnostic • Share representation across hardware targets • Common analysis, optimizations and transformations 왜 linalg Dialect? • linalg의 value semantics가 Triton의 BLOCK SIMD-style code와 잘 맞음 • linalg dialects는 iteration space가 아닌 data space를 tiling하여 고수준 semantics를 유지 • Affine dialects는 루프기반이라 control flow가 복잡해지고 주소 계산식을 복잡하게 만듦 • Transformation와 Optimization을 더욱 쉽게 만들어줌

MLIR based Triton Architecture Frontend / Optimizer / Backend 세
부분으로 나뉨 https://superjomn.github.io/posts/triton-mlir-publish/ Frontend • Python kernel code → Triton Dialect(Triton IR) • Kernel launch를 위한 runtime 관리 Optimizer • Triton Dialect(Triton IR) → TritonGPU Dialect (TritonGPU IR) • Multiple pass를 이용한 최적화 Backend • → TritonGPU Dialect (TritonGPU IR) → LLVM IR Triton Dialect(Triton IR) • 계산로직을 표현하는 Hardware Agnostic Representation TritonGPU Dialect (TritonGPU IR) • GPU관련 계산 표현 MLIR에서 재사용하는 dialects • std dialect: tensor, int, float 등 데이터 타입 • artih dialect: 각종 수학연산 • scf dialect: if, for 등의 제어흐름 • nvvm dialect: thread_id 등 을 얻는 연산 • gpu dialect: printf 등의 연산

Optimizer Passes Triton IR 최적화 / TritonIR → TritonGPU IR
변환 / TritonGPU IR 최적화 세 가지 패스로 나뉨 https://superjomn.github.io/posts/triton-mlir-publish/ Triton IR 최적화 (MLIR general optimization) (계산 자체의 최적화) • Inliner: kernel call을 inlining • Combine: 특정 패턴 rewrite • Canonicalizer: 각종 단순화 패턴 write • CSE: MLIR의 공통 부분식 제거 • LICM: MLIR의 LoopInvariantCodeMotion Pass • 루프 불변 변수를 for-loop 밖으로 이동 TritonGPU IR 최적화 (GPU 하드웨어 최적화 추가) • ConvertTritonToTritonGPU: 변환 Pass 주로 TritonGPU 고유 Layout 추 가 • Coalesce: Order를 재배열해 최대 continguity 축이 앞으로 오도록 함 • Pipeline: MMA 명령에 대응하는 global memory→shared memory의 N-Buffer 최적화 • Prefetch: MMA 명령에 대응하는 shared memory→register file의 N- Buffer 최적화

Reference • https://www.kapilsharma.dev/posts/deep-dive-into-triton-internals/ • https://hpcgpu.mini.pw.edu.pl/cuda-compilation-toolchain/ • https://modal.com/gpu-glossary • https://superjomn.github.io/posts/triton-mlir-publish/ •
번역 : https://rosettalens.com/s/ko/triton-mlir-publish • https://education.molssi.org/gpu_programming_beginner/04-gpu-compilation-model.html

Dive into Triton Internals

Dive into Triton Internals

Liam Jongsu Kim

More Decks by Liam Jongsu Kim

Other Decks in Programming

Featured

Transcript

Dive into Triton Internals https://www.kapilsharma.dev/posts/deep-dive-into-triton-internals/ Jongsu Kim

Triton? 사용하기 어려운 CUDA, 세밀한 컨트롤이 어려운 (커스텀 데이터 구조

How CUDA Compiler Works https://hpcgpu.mini.pw.edu.pl/cuda-compilation-toolchain/ Whole Program Compilation(Default) • 파일을

How CUDA Compiler Works https://hpcgpu.mini.pw.edu.pl/cuda-compilation-toolchain/ Host 관점 Whole Program Compilation(Default)

How CUDA Compiler Works https://hpcgpu.mini.pw.edu.pl/cuda-compilation-toolchain/ Device 관점 Whole Program Compilation(Default)

How CUDA Compiler Works Dual Compilation(Default) • 파일을 쪼개지 않고

CUDA (Thread) vs Triton (Tile) CUDA Programming Model (SIMT) •

CUDA (Thread) vs Triton (Tile) CUDA Programming Model (SIMT) https://triton-lang.org/main/programming-guide/chapter-1/introduction.html

Triton Compiler Architecture (2019) 개발자는 Triton-C로 작성하고, 최적화를 위해 Triton-IR로

Triton Compiler Architecture (2019) 개발자는 Triton-C로 작성하고, 최적화를 위해 Triton-IR로

Triton Compiler Architecture (2019) 개발자는 Triton-C로 작성하고, 최적화를 위해 Triton-IR로

Triton Compiler Architecture (2022~) 2022년 기존의 Triton Compiler Infrastructure를 MLIR

왜 Middle Layer인가? Language Agnostic, HW Agnostic Layer에 대한 필요성이

MLIR based Triton Architecture Frontend / Optimizer / Backend 세

Optimizer Passes Triton IR 최적화 / TritonIR → TritonGPU IR

Reference • https://www.kapilsharma.dev/posts/deep-dive-into-triton-internals/ • https://hpcgpu.mini.pw.edu.pl/cuda-compilation-toolchain/ • https://modal.com/gpu-glossary • https://superjomn.github.io/posts/triton-mlir-publish/ •