Baselines for Human Pose Estimation." arXiv preprint arXiv:2204.12484 (2022). 2. osovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020. 3. K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, 2022. 4. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), 2014. 5. Cao, Zhe, et al. “OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields.” IEEE transactions on pattern analysis and machine intelligence 43.1 (2019). 6. Dang, Qi, et al. "Deep learning based 2d human pose estimation: A survey“ Tsinghua Science and Technology 24.6 (2019): 663-676. 7. Senior, Andrew W., et al. "Improved protein structure prediction using potentials from deep learning." Nature 577.7792 (2020): 706-710. 8. .https://www.slideshare.net/DeepLearningJP2016/dltransformer-vit-perceiver-frozen-pretrained-transformer-etc 9. Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems. 2017. 10. Girshick, Ross. "Fast r-cnn." Proceedings of the IEEE international conference on computer vision. 2015. 38