[NeurIPS Japan meetup 2021 talk] Understanding Negative Samples in Instance Discriminative Self-supervised Representation Learning

Kento Nozawa1,2 Understanding Negative Samples in Instance Discriminative Self-supervised representation
Learning Paper: https://openreview.net/forum?id=pZ5X_svdPQ Code: https://github.com/nzw0301/Understanding-Negative-Samples Issei Sato1 1 2

Short summary of this talk 2 128 256 384 512
640 768 896 1024 # negative samples +1 101 102 103 104 105 Upper bound of supervised loss Arora et al. Ours Validation accuracy 42 43 44 45 46 47 Validation accuracy (%) on CIFAR-100 • We point out the inconsistency between self-supervised learning’s common practice and an existing theoretical analysis. • Practice: Large # negative samples don’t hurt classi fi cation performance. • Theory: they hurt classi fi cation performance. • We propose an novel analysis using Coupon   collector’s problem. Accuracy: higher is better Bound: lower is better

Instance discriminative self-supervised representation learning 3 Goal: Learn generic feature
encoder , for example deep neural nets, for a downstream task, such as classi fi cation. Feature representations help a linear classi fi er to attain classi fi cation accuracy comparable to a supervised method from scratch. f

Overview of Instance discriminative self-supervised representation learning 4 Anchor x
Negative x− Draw samples from an unlabeled dataset. • : anchor sample. • : negative sample. It can be a set of samples . K + 1 x x− {x− k }K k=1

Negative x− a a+ a− Apply data augmentation to the samples: • For the anchor sample , we draw and apply two data augmentations . • For negative sample , we draw and apply single data augmentation . x a, a+ x− a−

Negative x− a a+ a− f h h− h+ Feature encoder maps augmented samples to feature vectors . f h, h+, h−

• Minimize a contrastive loss given feature representations. • :
a similarity function, such as cosine similarity. • Learned works as a feature extractor for a downstream task. sim( . ⋅ . ) ̂ f Overview of Instance discriminative self-supervised representation learning 7 Anchor x Negative x− a a+ a− f h h− h+ Contrastive loss function, ex. InfoNCE [1]: −ln exp[sim(h, h+)] exp[sim(h, h+)] + exp[sim(h, h−)] } [1] Oord et al. Representation Learning with Contrastive Predictive Coding, arXiv, 2018.

Common technique: use large # negative samples K 8 By
increasing # negative samples, learned yields informative features for linear classi fi er in practice. For ImageNet, wMoCo [2]: . wSimCLR [3]: or even more. ̂ f K = 65 536 K = 8 190 32 64 128 256 512 # negative samples +1 78 80 82 84 Validation accuracy (%) on CIFAR-10 [2] He et al. Momentum Contrast for Unsupervised Visual Representation Learning, In CVPR, 2020. [3] Chen et al. A Simple Framework for Contrastive Learning of Visual Representations, In ICML, 2020.

A theory of contrastive representation learning 9 Informal bound [4]
modi fi ed for self-supervised learning: . • : Collision probability that anchor’s label appears in negatives’ one. Lcont (f) ≥ (1 − τK )(Lsup (f) + Lsub (f)) + τK ln(Col + 1) Collison term + d(f) τK [4] Arora et al. A Theoretical Analysis of Contrastive Unsupervised Representation Learning, In ICML, 2019.

A theory of contrastive representation learning 10 Informal bound [4]
modi fi ed for self-supervised learning: . • : Collision probability that anchor’s label appears in negatives’ one. • : Supervised loss with . • : Supervised loss over subset of labels with . • : the number of duplicated negative labels with the anchor’s label. • : a function of , but almost constant term in practice. Lcont (f) ≥ (1 − τK )(Lsup (f) + Lsub (f)) + τK ln(Col + 1) Collison term + d(f) τK Lsup (f) f Lsub (f) f Col d(f) f [4] Arora et al. A Theoretical Analysis of Contrastive Unsupervised Representation Learning, In ICML, 2019.

The bound of explodes with large Lsup K 11 •
The bound on CIFAR-10, where # classes is with : • About samples contribute the collision term not related to the supervised loss due to . • Plots rearranged upper bound: : 10 K = 31 96 % τK Lsup (f) ≤ (1 − τK )−1[Lcont (f) − τK ln(Col + 1) − d(f)] − Lsub (f) 32 64 128 256 512 # negative samples +1 103 106 109 1012 1015 Upper bound of supervised loss Arora et al. Validation accuracy 77 78 79 80 81 82 83 84 Validation accuracy (%) on CIFAR-10 128 256 384 512 640 768 896 1024 # negative samples +1 101 102 103 104 105 Upper bound of supervised loss Arora et al. Validation accuracy 42 43 44 45 46 47 Validation accuracy (%) on CIFAR-100

Contributions: novel lower bound of contrastive loss 12 Informal proposed
bound: . • Key idea: replace collision probability with Coupon collector’s problem’s probability that samples’ labels include the all supervised labels. Lcont (f) ≥ 1 2 {υK+1 Lsup (f) + (1 − υK+1 )Lsub (f) + ln(Col + 1)} + d(f) τ υK+1 K + 1

Contributions: novel lower bound of contrastive loss 13 Informal proposed
bound: . • Key idea: replace collision probability with Coupon collector’s problem’s probability that samples’ labels include the all supervised labels. • Additional insight: the expected to draw all supervised labels   from ImageNet-1K is about . Lcont (f) ≥ 1 2 {υK+1 Lsup (f) + (1 − υK+1 )Lsub (f) + ln(Col + 1)} + d(f) τ υK+1 K + 1 K + 1 7700

Our bound doesn’t explode 14 32 64 128 256 512
# negative samples +1 102 105 108 1011 1014 1017 Upper bound of supervised loss Arora et al. Ours Validation accuracy 77 78 79 80 81 82 83 84 Validation accuracy (%) on CIFAR-10 128 256 384 512 640 768 896 1024 # negative samples +1 101 102 103 104 105 Upper bound of supervised loss Arora et al. Ours Validation accuracy 42 43 44 45 46 47 Validation accuracy (%) on CIFAR-100

Justi fi cation of clustering-based algorithms 15 • Clustering-based method,
such as SwAV [4], minimizes InfoNCE loss with prototype representations instead of positive / negative features . • SwAV performs well with a smaller mini-batch size than SimCLR. c+, c− h+, h− Note that this diagram is very simpli fi ed for my presentation. [4] Caron et al. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments, In NeurIPS, 2020. a f h c− c+ Contrastive loss function, ex. InfoNCE [1]: −ln exp[sim(h, c+)] exp[sim(h, c+)] + exp[sim(h, c−)] } Anchor x Clustering

Justi fi cation of clustering-based algorithms 16 • Clustering-based method,
such as SwAV [4], minimizes InfoNCE loss with prototype representations instead of negative features. • SwAV performs well with a smaller mini-batch size than SimCLR. • Comparison: • Instance discriminative method, e.g., SimCLR: needs large negative samples to cover supervised labels. Negative samples come from a mini-batch. • Clustering-based method: unnecessary to use large mini-batch size thanks to prototype representations. [4] Caron et al. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments, In NeurIPS, 2020.

Conclusion 17 • We pointed out the inconsistency between self-supervised
learning’s common practice and the existing bound. • Practice: Large doesn’t hurt classi fi cation performance. • Theory: large hurts classi fi cation performance. • We proposed the new bound using Coupon collector’s problem. • Additional results: • Upper bound of the collision term. • Optimality when with too small . • Experiments on a NLP dataset. K K υ = 0 K

[NeurIPS Japan meetup 2021 talk] Understanding ...

[NeurIPS Japan meetup 2021 talk] Understanding Negative Samples in Instance Discriminative Self-supervised Representation Learning

Kento Nozawa

More Decks by Kento Nozawa

Other Decks in Research

Featured

Transcript

Kento Nozawa1,2 Understanding Negative Samples in Instance Discriminative Self-supervised representation

Short summary of this talk 2 128 256 384 512

Instance discriminative self-supervised representation learning 3 Goal: Learn generic feature

Overview of Instance discriminative self-supervised representation learning 4 Anchor x

Overview of Instance discriminative self-supervised representation learning 5 Anchor x

Overview of Instance discriminative self-supervised representation learning 6 Anchor x

• Minimize a contrastive loss given feature representations. • :

Common technique: use large # negative samples K 8 By

A theory of contrastive representation learning 9 Informal bound [4]

A theory of contrastive representation learning 10 Informal bound [4]

The bound of explodes with large Lsup K 11 •

Contributions: novel lower bound of contrastive loss 12 Informal proposed

Contributions: novel lower bound of contrastive loss 13 Informal proposed

Our bound doesn’t explode 14 32 64 128 256 512

Justi fi cation of clustering-based algorithms 15 • Clustering-based method,

Justi fi cation of clustering-based algorithms 16 • Clustering-based method,

Conclusion 17 • We pointed out the inconsistency between self-supervised