al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR. 2021. https://openreview.net/forum?id=YicbFdNTTy P.15 [Radford+, 2021] Radford et al. Learning Transferable Visual Models From Natural Language Supervision.” ICML2021. [Rombach+, 2022] Rombach et al. High-Resolution Image Synthesis with Latent Diffusion Models, CVPR2022. P.16 [He+, 2022] He et al. Masked autoencoders are scalable vision learners. CVPR. 2022. P.28 [Ba+, 2016] Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer Normalization. arXiv. 2016. P.29 [He+, 2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR. 2016. P.36 [Tolstikhin+, 2021] Tolstikhin, I. O. et al. MLP-Mixer: An all-MLP Architecture for Vision. NeurIPS. 2021.