Transformers [2] にも実装があり参考にできた - シンプルな構成であるため、推論バックエンドの変更に頑健 > 例1: PyTorchに同梱されているFlash Attention 2 [3] への変更が容易 > 例2: ONNX Runtime [4] やTensorRT [5] への変換が容易 Viola: モデル構造 [1] J. Wang et al., “GIT: A Generative Image-to-text Transformer for Vision and Language,” Transactions on Machine Learning Research, 2022. [2] T. Wolf et al., “HuggingFace’s transformers: State-of-the-art natural language processing,” arXiv preprint arXiv:1910.03771, 2019. [3] T. Dao, “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning,” in The Twelfth International Conference on Learning Representations, 2023. [4] https://onnxruntime.ai/ [5] https://developer.nvidia.com/tensorrt 10