GPUs • modeling the bidimensional nature of a spectrogram • add a distance penalty to the attention to bias it towards local context capture local 2D-invariant features model context 2. End-To-End outperform cascading with Transformer SAN - Self- Attention Network MHA - Multi-Head Attention