a unified foundation model: Jointly pre-training transformers on unpaired images and text. 2021 He et al., Masked Autoencoders Are Scalable Vision Learners. 2021
zero-shot 強力な学習済みモデルで多くをこなせるが、 できることは固定されている Yuan el al., Florence: A new foundation model for computer vision. 2021 Brown, et al., Language Models are Few-Shot Learners. 2020