model size increases, the tensor size increases. → activation outliers [Dettmers 2022] One solution is to divide the Tensor into smaller blocks and apply the scaling factor to each block, rather than applying it to the entire Tensor. [MS, AMD, Intel, Meta, NVIDIA, Qualcomm, 2023] NVIDIA Blackwell supports Blockwise FP8 GEMM called MXFP8 (MicroScaling FP8) in hardware. However, Hopper does not have such a function. Furthermore, the fact that performance does not decline when using MXFP8 for training LLMs has only been verified for application to GEMM in the Linear layer. [GTC25] Activation values plot (8B)