Soft-Weight Networks for Few-Shot Learning

Preliminaries Neural Networks Few-Shot Learning and Soft Weight Networks Fel´
ına Soft Weight Networks for Few-Shot Learning Samuel Hess and Gregory Ditzler Department of Electrical & Computer Engineering The University of Arizona Tucson, AZ 85721 {shess, ditzler}@email.arizona.edu 04 June 2019 Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Overview Plan of Attack for the Next 40 minutes 1 Overview of Learning from Data 2 What is Few-Shot Learning 3 Soft Weight Networks 4 Preliminary Results 5 Applications & Conclusion Samuel Hess and Gregory Ditzler, “Soft Weight Networks for Few-Shot Learning,” submitted to Neural Networks, 2019. Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Overview of Research Feature Selection Neyman- Pearson MABS+FS Parallelism Adversarial FS Misc. Model Optimization Compressive Sensing Applications Environmental Cyber Human Health Learning Few-Shot Ensembles Concept Drift Partial Information Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Overview of Learning from Data Supervised Machine Learning in a Nutshell Data Machine Learning Model D := {(xi , yi )}n i=1 b y = (x) Dtest := {(xi , yi )}n i=1 Machine Learning Model Deployment Predictions Free Parameters ✓ Different Losses to Minimize −3 −2 −1 0 1 2 3 0 0.5 1 1.5 2 2.5 3 3.5 4 h ⋅ f L(h,f) log 0−1 hinge modified Huber Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Overview of Learning from Data Supervised(ish) Learning from Data First day on the job for an ML engineer Let us consider the situation where we want to train a convolutional neural network (CNN) on a database of image that are all labeled. Our goal (in general) is to minimize some cost function with backpropagation such that we have a small generalization loss in the future. What is the end goal for industry and DoD? Generalization is typically one of the most important properties an algorithm can have once it is deployed in a real-time system where classiﬁcation is performed. There is more uncertainty when these predictions are being made because the environment is generally not as controlled as it was when the CNN was trained. Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Motivation for Neural Neural Networks 30k View of a Multilayer Perceptron Training a Neural Network What is an Artificial Neural Network? Σ σ x1 x2 x3 x4 xn b w 1 w 2 w3 w4 w n zj . . . zj = σ               n i=1 wi xi        + b        = σ wT j x + b Definition: Neural Network is a highly interconnected network of information-processing elements that mimics the connectivity and functioning of the human brain. One of the most significant strengths of neural networks is their ability to learn from a limited set of examples. Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Motivation for Neural Neural Networks 30k View of a Multilayer Perceptron Training a Neural Network Artiﬁcial Neural Networks – The Multi-layer Perceptron What is a Neural Network? Data are passed through the network to a hidden layer node that is a linear combination of the inputs with a nonlinear function applied z = σ (WMD x + bMD ) ; y = ξ (WKM z + bKM ) Given a dataset, our goal is to ﬁnd the weight parameters that minimize a cost function (e.g., MSE or cross-entropy) then these parameters are optimized with gradient descent wτ+1 = wτ − η∇C(wτ) – or – wτ+1 = wτ − η∇Cn (wτ) where η > 0 is known as the learning rate Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Motivation for Neural Neural Networks 30k View of a Multilayer Perceptron Training a Neural Network Nonlinear Activation Functions Why do we need a nonlinear function? The nonlinear activation functions allow the neural network to solve complex nonlinear problems The universal approximation theorem states that a neural network with a single hidden layer containing a ﬁnite number of neurons can approximate continuous functions under mild assumptions on the activation function Note that we do not know the functions that are needed We also do not know the architecture of the neural network Examples of nonlinear activation functions Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Motivation for Neural Neural Networks 30k View of a Multilayer Perceptron Training a Neural Network Gradient and Stochastic Gradient Descent How do we train a neural network? Neural networks are trained using gradient and stochastic gradient descent, which means the backpropagation algorithm will take small steps to minimize the cost function The optimization of the neural network weights is a nonconvex problem that will lead to a local minimum Two views of optimization: batch & online Batch-based Gradient Descent vs. Stochastic Gradient Descent Tricks for training Momentum and accelerated gradient methods (e.g., Nesterov) can help us get to a better local minimum Use validation data to avoid over training (i.e., memorizing) Adaptive step sizes can void ﬁnding worse local minima Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Motivation for Neural Neural Networks 30k View of a Multilayer Perceptron Training a Neural Network Other Architectures: Convolutional Neural Network (CNN) Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Motivating Few-Shot Learning Benchmarks in Few-Shot Learning Related Works Soft Weight Networks Few-Shot Learning We know have a general understanding of how a neural network is trained in a supervised learning setting. Now let us understand the task of few-shot learning. Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Motivating Few-Shot Learning Benchmarks in Few-Shot Learning Related Works Soft Weight Networks Motivating Few-Shot Learning with an Example Data available at training time for a neural net Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Motivating Few-Shot Learning Benchmarks in Few-Shot Learning Related Works Soft Weight Networks Motivating Few-Shot Learning with an Example What should the CNN tell me this is? Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Motivating Few-Shot Learning Benchmarks in Few-Shot Learning Related Works Soft Weight Networks Motivating Few-Shot Learning Too much data = Problems! Too little data = Even more problems! What if there were only a few samples given to us that were a snake? Deep neural networks easily over-fit the training data, especially when only a small amount of training data is available. Is there a better way of training a neural network to classify with such few training examples, because even some of the proposed approaches still require more than a “few” examples to be able to achieve this with arbitrarily high accuracy? What is few-shot classification? Few-shot classification is a task in which a classifier must be adapted to accommodate new classes not seen in training, given only a few examples of each of these classes. Humans have the ability to perform even one-shot classification, but it is much more difficult to have a neural network achieve this task. Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Motivating Few-Shot Learning Benchmarks in Few-Shot Learning Related Works Soft Weight Networks Few-Shot Objective Given only one (or a “few”) labeled examples of novel classes during testing, correctly identify other unlabeled examples with arbitrarily high accuracy 20-Way 1-Shot Example Given 20 1-shot labeled sample Problem Label this sample Answer This sample belongs to class 6 in the labeled support set N-way = N classes; k-shot = k labeled sample(s) per class. Performance of an algorithm should decrease with N and increase with k. Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Motivating Few-Shot Learning Benchmarks in Few-Shot Learning Related Works Soft Weight Networks Benchmark Data Sets Data Sets There are two commonly used benchmark data sets established for few-shot algorithm evaluation: Omniglot: 50 different language alphabets (1,623 total characters/classes) with 20 human drawn samples for each character (32,460 total samples) MiniImageNet: 100 classes of color images with 600 samples per class (60,000 total samples) Subset of the commonly used ILSVRC 2012 which has 1200 samples per class and 1000 classes Partition the data into disjoint set of classes: training (65%), validation (10%), and testing (25%) Side Note Although the number of samples per class is small, the number of total samples is sufﬁcient to train a neural network. Oriol Vinyals et al. found a way to exploit this in Matching Networks (2016). Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Motivating Few-Shot Learning Benchmarks in Few-Shot Learning Related Works Soft Weight Networks Matching Networks Episodic training: Repeatedly and randomly select a set of classes from the training data, partition those class’ samples into support (labeled) group and query (unlabeled) group, and train the network to select the correct support class for each of the query samples. Instead of learning to identify a speciﬁc class, the matching network learns to match a given sample to another regardless of class. Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Motivating Few-Shot Learning Benchmarks in Few-Shot Learning Related Works Soft Weight Networks Prototypical Networks Prototypical Networks Prototypical Networks simpliﬁed the Matching Network by embedding each of the support samples one-at-a-time instead of all-at-once. Via SGD Prototypical Networks minimize the Euclidean distance between the average embedding of the support samples with respect to the query embedding. Comparison with Matching Networks Still uses episodic training Not all support samples need to be included together Reduces number of learned parameters Trained 5-way, 1-shot network does not have to be retrained for arbitrary N-way, k-shot network Uses the Euclidean distance cost function instead of cosine distance Demonstrated better performance Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Motivating Few-Shot Learning Benchmarks in Few-Shot Learning Related Works Soft Weight Networks Prototypical Networks Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Motivating Few-Shot Learning Benchmarks in Few-Shot Learning Related Works Soft Weight Networks Soft Weight Networks Soft Weight Networks Soft Weight Networks (SWN) modify Prototypical Networks by including a soft weight layer to cross compare support sample features to query sample weights and compare support sample weights to query sample features. Like prototypical nets, the SWN still uses episodic training Network architecture is a novel design for two way cross comparison between support and query Similarity metric is modiﬁed to utilize new design architecture Protoypical nets use minimum Euclidean distance between embedded features SWNs use maximum sum of embedded features weighted by embedded weights SWNs use the episodes to learn the similarity metric for each support-query pair Empirical performance results have shown that both the network architecture and the similarity function improve performance in few-shot bench mark tests Access to the feature weights allows potential for feature selection and online learning Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Motivating Few-Shot Learning Benchmarks in Few-Shot Learning Related Works Soft Weight Networks Soft Weight Networks Feature Embedding Layers Soft Weight Layer Feature Embedding Layers Soft Weight Layer Soft Weight Features: Soft Weight Features: + Same network run with support and query samples Score b (f (xs )) b (f (xq )) w (f (xq )) w (f (xs )) f (xq ) f (xs ) xq xs fT(xq )w (f (xs )) + b (f (xs )) fT(xs )w (f (xq )) + b (f (xq )) Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Motivating Few-Shot Learning Benchmarks in Few-Shot Learning Related Works Soft Weight Networks Soft Weight Networks Training Algorithm Input: Entire Training Set Tr = {(x1 , y1), ..., (xNTr , yNTr )}, number of support samples per episode N, number of query samples per episode M, number of classes per episode C, initial model φ, and model learning rate η Output: updated model φ, network loss J(φ), and class estimates yqm 1: for each episode in training data do 2: Create a set, E = {(x1 , y1), ..., (xNE , yNTE )}, of C different classes randomly chosen from Tr without replacement 3: Create disjoint sets, S = {(x1 , y1), ..., (xN , yN )} and Q = {(x1 , y1), ..., (xM , yM)}, of N support examples and M query samples from E 4: for m = 1, . . . , M do {Loop through query samples} 5: for c = 1, . . . , C do {Loop through class samples} 6: // Compute the classiﬁcation scores for every query sample 7: αqm(c) = (xsi,ysi)∈Sc bφ(fφ(xsi)) + fT φ (xqm)wφ(fφ(xsi)) 8: βqm(c) = (xsi,ysi)∈Sc bφ(fφ(xqm)) + fT φ (xsi)wφ(fφ(xqm)) 9: end for 10: for c = 1, . . . , C do 11: pφ(y = c|xqm) = e(αqm(c)+βqm(c)) c ∈C e(αqm(c )+βqm(c )) {Estimate posterior via softmax} 12: end for 13: yqm =c∈C [pφ(y = c|xqm )] {Estimate class of query sample} 14: end for 15: J(φ) = − 1 CM C c =1 M q=1 log(pφ(y = c |xqm)) {Compute average loss of episode} 16: φ ← φ − η∇φJ(φ) {Update model} 17: end for Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Motivating Few-Shot Learning Benchmarks in Few-Shot Learning Related Works Soft Weight Networks Setting up the Soft Weight Networks Soft Weight Networks The features of each of the the query samples are weighted by the soft weights of the support samples, which is similar to how the classiﬁcation layer of a conventional neural network node uses hard weights on the feature layer. More formally this is given by αqm (c) = (xsi,ysi)∈Sc bφ(fφ(xsi )) + fT φ (xqm )wφ(fφ(xsi )) Conversely, the weights of the query samples are used to weight to the features of the support samples to obtain the other similarity metric, βqm (c) = (xsi,ysi)∈Sc bφ(fφ(xqm )) + fT φ (xsi )wφ(fφ(xqm )) Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Motivating Few-Shot Learning Benchmarks in Few-Shot Learning Related Works Soft Weight Networks Learning a Soft Weight Network Optimization of the SWN A distribution is established over the classes via a softmax function, pφ(y = c|xqm ) = e(αqm(c)+βqm(c)) C c =1 e(αqm(c )+βqm(c )) During training, the soft weight network randomly selects a subset of classes from training data and learns the network parameters, φ, via stochastic gradient descent by minimizing the negative log probability average J(φ) = − 1 CM C c =1 M q=1 log(pφ(y = c |xqm )) Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Motivating Few-Shot Learning Benchmarks in Few-Shot Learning Related Works Soft Weight Networks Example: Two Way Comparison Consider the following two classes Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Motivating Few-Shot Learning Benchmarks in Few-Shot Learning Related Works Soft Weight Networks Example: Two Way Comparison Prototypical Networks Soft Weight Networks The two way comparison has shown to perform better empirically Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Motivating Few-Shot Learning Benchmarks in Few-Shot Learning Related Works Soft Weight Networks Performance Comparison: Omniglot Omniglot Performance for Soft Weight Networks is unanimously higher on the Omniglot data set than previous algorithms. Model Classiﬁcation Performance 1-Shot 5-Shot 5-Way 20-Way 5-Way 20-Way Siamese∗ 98.8% 95.5% – – Matching Networks∗ 98.1% 93.8% 98.9% 98.5% Prototypical Networks∗ 98.8% 96.0% 99.7% 98.9% MAML∗ 98.7% 95.8% 99.9% 98.9% ConvNet w/Memory∗ 98.4% 95.0% 99.6% 98.6% mAP-SSVM∗ 98.6% 95.2% 99.6% 98.6% mAP-DLM∗ 98.8% 95.4% 99.6% 98.6% Soft Weight Networks 99.7% 98.3% 99.9% 99.6% ∗ Results reported by the authors. Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Motivating Few-Shot Learning Benchmarks in Few-Shot Learning Related Works Soft Weight Networks Performance Comparison: MiniImageNet MiniImageNet The current state-of-the-art best published performance is held by Prototypical Networks. Soft Weight Networks uses a single trained network (1-shot network) and demonstrates better general performance than Prototypical networks across 1-shot and 5-shot testing. Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Current/Future Work Feature Selection Recall, Soft Weight Networks provide a soft weight for each embedded feature. What if we use the original features instead of the embedded features? Classification performance degrades but the average weighting of the original features can be used to obtain the average ”usefulness” of each feature for few-shot classification. This “usefulness” can be used as a metric for feature selection. Online Networks Since the network and class have been decoupled with few-shot neural networks an obvious extension is to apply it to online classification Data Stream Online Class Identification Novel Classes Parse New Classes Update Network Trained Few-Shot Neural Network Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

ına Thank you! Drexel University (04 June 2019) Soft Weight Networks for Few-Shot Learning

Soft-Weight Networks for Few-Shot Learning

Soft-Weight Networks for Few-Shot Learning

More Decks by Gregory Ditzler

Other Decks in Research

Featured

Transcript