Transformer ⾔語のトークン列 Image Encoder Adapter ⾔語のトークン列 Transformer 特殊トークン Projectorを⽤いる⽅式 GIT [Wang+], LLaVA [Liu+]など Cross Attentionを⽤いる⽅式 BLIP2 [Li+], Flamingo [Alayrac+] など Alayrac, Jean-Baptiste, et al. "Flamingo: a visual language model for few-shot learning." NeurIPS 2022. Li, Junnan, et al. "Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models." ICML 2023. Wang, Jianfeng, et al. "Git: A generative image-to-text transformer for vision and language." arXiv preprint arXiv:2205.14100 (2022). Liu, Haotian, et al. "Visual instruction tuning." NeurIPS 2024.