ContrastiveLoss with CrossBatchMemory (1024) 512 embedding Image Encoder cite images train data Weak Aug apply images Strong Aug apply images Weak Aug cite images Strong Aug cite images or Training Swin Transformer (swin_large_patch4_ window12_384) ConvNext (convnext_base_ 22k_384) BatchNorm Normalize FC Normalize FC 1536 1024