and Editing with Visual Foundation Models. arXiv, 2023. [You+,2023] You, H. et al. IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models. arXiv, 2023. [Oord+,2017] Aaron van den Oord et al. Neural Discrete Representation Learning. NIPS2017. [Ramesh+,2021] Aditya Ramesh, et al. Zero-Shot Text-to-Image generation. arXiv, 2021. [Mizrahi+,2017] Mizrahi, D. et al. 4M: Massively Multimodal Masked Modeling. NeurIPS, 2023. [Ramesh+,2022] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv, 2022. [Cha+,2023] Cha, J., Kang, W., Mun, J. & Roh, B. Honeybee: Locality-enhanced Projector for Multimodal LLM. arXiv, 2023. [LI+,2023] Li, J., Li, D., Savarese, S. & Hoi, S. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv, 2023. [Dai+,2023] Dai, W. et al. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv, 2023. [Li+,2023] Li, K. et al. VideoChat: Chat-Centric Video Understanding. arXiv, 2023. [Zhu+,2023] Zhu, D., Chen, J., Shen, X., Li, X. & Elhoseiny, M. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv, 2023. [Liu+,2023a] Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual Instruction Tuning. arXiv, 2023. [Liu+,2023b] Liu, H., Li, C., Li, Y. & Lee, Y. J. Improved Baselines with Visual Instruction Tuning. arXiv, 2023. [Zhang+,2023] Zhang, H. et al. LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models. arXiv, 2023. [Radford+,2021] Alec Radford, et al. Learning transferable visual models from natural language supervision. ICML, Vol. 139, pp. 8748–8763, 2021. [Maini+,2023] Maini, P., Goyal, S., Lipton, Z. C., Zico Kolter, J. & Raghunathan, A. T-MARS: Improving Visual Representations by Circumventing Text Feature Learning. arXiv, 2023. [Shtedritski+,2023] Shtedritski, A., Rupprecht, C. & Vedaldi, A. What does CLIP know about a red circle? Visual prompt engineering for VLMs. ICCV, 2023. 49/51