training ▪ ImageNet-1Kで事前学習させたモデルをVideoMAE [74]と同じトレーニング戦略で学習する ◦ self-supervised training ▪ UMTと同様のトレーニングレシピを採用し、CLIP-ViT-B [60]を使用してVideoMamba-Mを800エ ポックで蒸留する dataset average video length train valuation Kinetics-400 10s 234619 19761 Something-SomethingV2 4s 168913 24777 [74] Tong, Z., Song, Y., Wang, J., Wang, L.: VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In: NeurIPS (2022) [60] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML (2021)