Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[NeurIPS Japan meetup 2021 talk] Understanding ...

Kento Nozawa
December 13, 2021

[NeurIPS Japan meetup 2021 talk] Understanding Negative Samples in Instance Discriminative Self-supervised Representation Learning

https://neuripsmeetup.jp/2021/
The slides are almost same as our NeurIPS talk

Kento Nozawa

December 13, 2021
Tweet

More Decks by Kento Nozawa

Other Decks in Research

Transcript

  1. Kento Nozawa1,2 Understanding Negative Samples in Instance Discriminative Self-supervised representation

    Learning Paper: https://openreview.net/forum?id=pZ5X_svdPQ Code: https://github.com/nzw0301/Understanding-Negative-Samples Issei Sato1 1 2
  2. Short summary of this talk 2 128 256 384 512

    640 768 896 1024 # negative samples +1 101 102 103 104 105 Upper bound of supervised loss Arora et al. Ours Validation accuracy 42 43 44 45 46 47 Validation accuracy (%) on CIFAR-100 • We point out the inconsistency between self-supervised learning’s common practice and an existing theoretical analysis. • Practice: Large # negative samples don’t hurt classi fi cation performance. • Theory: they hurt classi fi cation performance. • We propose an novel analysis using Coupon 
 collector’s problem. Accuracy: higher is better Bound: lower is better
  3. Instance discriminative self-supervised representation learning 3 Goal: Learn generic feature

    encoder , for example deep neural nets, for a downstream task, such as classi fi cation. Feature representations help a linear classi fi er to attain classi fi cation accuracy comparable to a supervised method from scratch. f
  4. Overview of Instance discriminative self-supervised representation learning 4 Anchor x

    Negative x− Draw samples from an unlabeled dataset. • : anchor sample. • : negative sample. It can be a set of samples . K + 1 x x− {x− k }K k=1
  5. Overview of Instance discriminative self-supervised representation learning 5 Anchor x

    Negative x− a a+ a− Apply data augmentation to the samples: • For the anchor sample , we draw and apply two data augmentations . • For negative sample , we draw and apply single data augmentation . x a, a+ x− a−
  6. Overview of Instance discriminative self-supervised representation learning 6 Anchor x

    Negative x− a a+ a− f h h− h+ Feature encoder maps augmented samples to feature vectors . f h, h+, h−
  7. • Minimize a contrastive loss given feature representations. • :

    a similarity function, such as cosine similarity. • Learned works as a feature extractor for a downstream task. sim( . ⋅ . ) ̂ f Overview of Instance discriminative self-supervised representation learning 7 Anchor x Negative x− a a+ a− f h h− h+ Contrastive loss function, ex. InfoNCE [1]: −ln exp[sim(h, h+)] exp[sim(h, h+)] + exp[sim(h, h−)] } [1] Oord et al. Representation Learning with Contrastive Predictive Coding, arXiv, 2018.
  8. Common technique: use large # negative samples K 8 By

    increasing # negative samples, learned yields informative features for linear classi fi er in practice. For ImageNet, wMoCo [2]: . wSimCLR [3]: or even more. ̂ f K = 65 536 K = 8 190 32 64 128 256 512 # negative samples +1 78 80 82 84 Validation accuracy (%) on CIFAR-10 [2] He et al. Momentum Contrast for Unsupervised Visual Representation Learning, In CVPR, 2020. [3] Chen et al. A Simple Framework for Contrastive Learning of Visual Representations, In ICML, 2020.
  9. A theory of contrastive representation learning 9 Informal bound [4]

    modi fi ed for self-supervised learning: . • : Collision probability that anchor’s label appears in negatives’ one. Lcont (f) ≥ (1 − τK )(Lsup (f) + Lsub (f)) + τK ln(Col + 1) Collison term + d(f) τK [4] Arora et al. A Theoretical Analysis of Contrastive Unsupervised Representation Learning, In ICML, 2019.
  10. A theory of contrastive representation learning 10 Informal bound [4]

    modi fi ed for self-supervised learning: . • : Collision probability that anchor’s label appears in negatives’ one. • : Supervised loss with . • : Supervised loss over subset of labels with . • : the number of duplicated negative labels with the anchor’s label. • : a function of , but almost constant term in practice. Lcont (f) ≥ (1 − τK )(Lsup (f) + Lsub (f)) + τK ln(Col + 1) Collison term + d(f) τK Lsup (f) f Lsub (f) f Col d(f) f [4] Arora et al. A Theoretical Analysis of Contrastive Unsupervised Representation Learning, In ICML, 2019.
  11. The bound of explodes with large Lsup K 11 •

    The bound on CIFAR-10, where # classes is with : • About samples contribute the collision term not related to the supervised loss due to . • Plots rearranged upper bound: : 10 K = 31 96 % τK Lsup (f) ≤ (1 − τK )−1[Lcont (f) − τK ln(Col + 1) − d(f)] − Lsub (f) 32 64 128 256 512 # negative samples +1 103 106 109 1012 1015 Upper bound of supervised loss Arora et al. Validation accuracy 77 78 79 80 81 82 83 84 Validation accuracy (%) on CIFAR-10 128 256 384 512 640 768 896 1024 # negative samples +1 101 102 103 104 105 Upper bound of supervised loss Arora et al. Validation accuracy 42 43 44 45 46 47 Validation accuracy (%) on CIFAR-100
  12. Contributions: novel lower bound of contrastive loss 12 Informal proposed

    bound: . • Key idea: replace collision probability with Coupon collector’s problem’s probability that samples’ labels include the all supervised labels. Lcont (f) ≥ 1 2 {υK+1 Lsup (f) + (1 − υK+1 )Lsub (f) + ln(Col + 1)} + d(f) τ υK+1 K + 1
  13. Contributions: novel lower bound of contrastive loss 13 Informal proposed

    bound: . • Key idea: replace collision probability with Coupon collector’s problem’s probability that samples’ labels include the all supervised labels. • Additional insight: the expected to draw all supervised labels 
 from ImageNet-1K is about . Lcont (f) ≥ 1 2 {υK+1 Lsup (f) + (1 − υK+1 )Lsub (f) + ln(Col + 1)} + d(f) τ υK+1 K + 1 K + 1 7700
  14. Our bound doesn’t explode 14 32 64 128 256 512

    # negative samples +1 102 105 108 1011 1014 1017 Upper bound of supervised loss Arora et al. Ours Validation accuracy 77 78 79 80 81 82 83 84 Validation accuracy (%) on CIFAR-10 128 256 384 512 640 768 896 1024 # negative samples +1 101 102 103 104 105 Upper bound of supervised loss Arora et al. Ours Validation accuracy 42 43 44 45 46 47 Validation accuracy (%) on CIFAR-100
  15. Justi fi cation of clustering-based algorithms 15 • Clustering-based method,

    such as SwAV [4], minimizes InfoNCE loss with prototype representations instead of positive / negative features . • SwAV performs well with a smaller mini-batch size than SimCLR. c+, c− h+, h− Note that this diagram is very simpli fi ed for my presentation. [4] Caron et al. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments, In NeurIPS, 2020. a f h c− c+ Contrastive loss function, ex. InfoNCE [1]: −ln exp[sim(h, c+)] exp[sim(h, c+)] + exp[sim(h, c−)] } Anchor x Clustering
  16. Justi fi cation of clustering-based algorithms 16 • Clustering-based method,

    such as SwAV [4], minimizes InfoNCE loss with prototype representations instead of negative features. • SwAV performs well with a smaller mini-batch size than SimCLR. • Comparison: • Instance discriminative method, e.g., SimCLR: needs large negative samples to cover supervised labels. Negative samples come from a mini-batch. • Clustering-based method: unnecessary to use large mini-batch size thanks to prototype representations. [4] Caron et al. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments, In NeurIPS, 2020.
  17. Conclusion 17 • We pointed out the inconsistency between self-supervised

    learning’s common practice and the existing bound. • Practice: Large doesn’t hurt classi fi cation performance. • Theory: large hurts classi fi cation performance. • We proposed the new bound using Coupon collector’s problem. • Additional results: • Upper bound of the collision term. • Optimality when with too small . • Experiments on a NLP dataset. K K υ = 0 K