Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Feedforward Neural Network (II): Multi-class Cl...

Feedforward Neural Network (II): Multi-class Classification

multi-class classification, linear multi-class classifier, softmax function, Stochastic Gradient Descent (SGD), mini-batch training, loss functions, activation functions, dropout

Avatar for Naoaki Okazaki

Naoaki Okazaki

July 28, 2020
Tweet

More Decks by Naoaki Okazaki

Other Decks in Research

Transcript

  1. Feedforward Neural Network (II): Multi-class Classification Naoaki Okazaki School of

    Computing, Tokyo Institute of Technology [email protected] PowerPoint template designed by https://ppt.design4u.jp/template/
  2. Highlights of this lecture  We extend binary classification to

    multi-class classification  Assign a weight vector for every category  Extend Perceptron algorithm to multi-class classification  Extend sigmoid function to softmax function  Again, automatic differentiation is useful for SGD training  ReLU is a popular activation function for internal layers  Dropout realizes model ensembling and averaging in a simple way 1
  3. MNIST database (LeCun+ 1998) 3 Yann LeCun, Leon Bottou, Yoshua

    Bengio, Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of IEEE, 86(11):2278-2324. 4 1 0 5 6 2 8 5 We want to classify an input image into 10 categories (digits)
  4. Representing an image on a computer 4  An image

    (28 x 28 pixels, grayscale) is represented by a 28 x 28 matrix.  The original dataset represents a brightness in an 8-bit integer ([0, 255]).  In this lecture, a brightness is normalized within the range of [0, 1].
  5. Representing an image with a vector 6 Pixel at (,

    ) Feature ID (row major): 28 − 1 +  We convert an image into a vector where each element presents the brightness of a pixel, flattening a 2D matrix into a 1D vector  A 28 × 28 matrix is converted into a vector of 784 (= 28 × 28) dimension  A more sophisticated method (e.g., Convolutional Neural Network) will be explained later  Even this simple treatment surprisingly works well
  6. Linear multi-class classification 7 ⋅ 0 = −1.24 ⋅ 1

    = −4.30 ⋅ 2 = −0.68 ⋅ 3 = +3.62 ⋅ 4 = −5.61 ⋅ 5 = −1.94 ⋅ 6 = −5.56 ⋅ 7 = −6.86 ⋅ 8 = −0.08 ⋅ 9 = −3.69 0 1 2 3 4 5 6 7 8 9 Image Compute the score (inner product) for each category Choose the category with the maximum score � = 3 A model has a weight vector for every category Pixels
  7. General form: linear multi-class classification 8 � = argmax ∈

    ⋅ Input: ∈ ℝ Output: � ∈ Parameter: weight ∈ ℝ (prepared for every category) Set of possible categories for the input (: number of dimension)
  8. Supervised learning (training) for multi-class classifier 9  We have

    a supervision data (: input, : output)  = { 1 , 1 , … , , } ( instances)  Find the weight vectors such that they can predict training instances as correctly as possible  ∀ ∈ {1, … , }: � = argmax ∈ ⋅ =  We assume generalization  If the parameters reproduce training instances well, they will work for unseen instances : the -th instance in the training data : the category for the -th instance
  9. Perceptron algorithm (Collins, 2002) 10 1. = 0 for all

    ∈ 2. Repeat: 3. (, ) ⟵ an instance chosen from at random 4. � ⟵ argmax ∈ ⋅ 5. if � ≠ then: (incorrect prediction) 6. ⟵ + ( ⋅ will be larger) 7. � ⟵ � − (� ⋅ will be smaller) 8. Until no instance updates Michael Collins. 2002. Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In Proc. of EMNLP, 1-8.
  10. Intuitive example of Perceptron updates 11 ⋅ 0 = −1.24

    ⋅ 1 = −4.30 ⋅ 2 = −0.68 ⋅ 3 = +3.62 ⋅ 4 = −5.61 ⋅ 5 = −1.94 ⋅ 6 = −5.56 ⋅ 7 = −6.86 ⋅ 8 = +3.87 ⋅ 9 = −3.69 0 1 2 3 4 5 6 7 8 9 Image (3) = 3 Compute the score (inner product) for each category Update necessary! The model predicts � = 8 but = 3 in fact Pixels 3 ⟵ 3 + 8 ⟵ 8 − ⋅ 3 will be larger after the update ⋅ 8 will be smaller after the update
  11. Summary  A linear multi-class classifier has a weight vector

    for every category ∈  Given an input , a linear multi-class classifier computes a score for every category as an inner product ⋅  It predicts a category � for the input yielding the highest score among the possible categories  Weight vectors can be trained by an extension of Perceptron algorithm to multi-class (structured perceptron)  Again, we cannot use it for multi-layer neural networks  Let’s consider SGD for training multi-class classifiers 13
  12. Training multi-class classifiers with SGD 15  In order to

    train binary classifiers using SGD, we had to change the activation function from step to sigmoid  What is the activation function for multi-class classification corresponding to sigmoid function?  Answer: Softmax function
  13. Softmax function: Definition 16  Given a vector ∈ ℝ,

    softmax : ℝ → ℝ yields, = exp( ) ∑=1 exp( )  Here denotes the -th element of the value of  We use the same notation (do not confuse with sigmoid)  A result of softmax function satisfies, ∀: > 0, � =1 = 1
  14. Softmax function: Intuitive explanation 17  Softmax function converts scores

    for caetgories ∈ ℝ into a probability distribution  Similarly to binary classification where sigmoid function converts a score to a probability Softmax
  15. Single-layer NNs for multi-class classification 18  Given an input

    ∈ ℝ, a single-layer NN for multi-class classification yields a probability distribution over categories � ∈ ℝ, � = , =  Here, ∈ ℝ× is a weight matrix  can be seen as a mapping: ℝ → ℝ  Let denote the -th row vector of the matrix  The score for the category is = ⋅  The same to the linear multi-class classification
  16. An example with softmax function 19 ⋅ = −1.24 ⋅

    = −4.30 ⋅ = −0.68 ⋅ = +3.62 ⋅ = −5.61 ⋅ = −1.94 ⋅ = −5.56 ⋅ = −6.86 ⋅ = −0.08 ⋅ = −3.69 0 1 2 3 4 5 6 7 8 9 Image Pixels (0.74%) (0.03%) (1.29%) (95.1%) (0.01%) (0.37%) (0.01%) (0.00%) (2.35%) (0.06%) � =
  17. Supervision data for multi-class (with a notational change) 20 

    We have a supervision data: = { 1 , 1 , … , , }( instances)  Input: = 1 , 2 , … , 𝑛𝑛 ⊺ ∈ ℝ  Output (changed from the previous notation): = 𝑛 , 𝑛 , … , ⊺ ∈ ℝ (one-hot vector) = 0,0,0,1,0,0,0,0,0,0 ⊺ 0 3 9 … … = 3
  18. Instance-wise likelihood 21  We introduce instance-wise likelihood to measure

    how well the parameters reproduce ( , ) = � =1 � (if 𝑛𝑛 = 1) 1 (if 𝑛𝑛 = 0) = � =1 �  The probability of the true label estimated by the model 20 ⋅ = −1.24 ⋅ = −4.30 ⋅ = −0.68 ⋅ = +3.62 ⋅ = −5.61 ⋅ = −1.94 ⋅ = −5.56 ⋅ = −6.86 ⋅ = −0.08 ⋅ = −3.69 0 1 2 3 4 5 6 7 8 9 (0.74%) (0.03%) (1.29%) (95.1%) (0.01%) (0.37%) (0.01%) (0.00%) (2.35%) (0.06%) = 0.951 = 0,0,0,1,0,0,0,0,0,0 ⊺
  19. Likelihood on the training data 22  We assume that

    all instances in the training data are i.i.d. (independent and identically distributed)  We define likelihood as a joint probability on data, = � =1  When the training data = { 1 , 1 , … , , } is fixed, likelihood is a function of the parameters  Let us maximize by changing  This is called Maximum Likelihood Estimation (MLE)  The maximizer ∗ reproduces the training data well
  20. Training as a minimization problem 23  Products of (0,1)

    values often cause underflow  Instead, use log-likelihood, the logarithm of the likelihood, = log = log � =1 = � =1 log  In mathematical optimization, we usually consider a minimization problem instead of maximization  We define an objective function () by using the negative of the log-likelihood = − = − � =1 log  is called a loss function or error function
  21. Training as a minimization problem 24  Given the training

    data = { 1 , 1 , … , , }, find ∗ as the minimization problem, ∗ = argmin = argmin � =1 − , = − log = − log � =1 � = − � =1 𝑛𝑛 log � 𝑛𝑛 ∗
  22. Stochastic Gradient Descent (SGD) 25  The objective function is

    the sum of losses of instances, = � =1 −  We can use Stochastic Gradient Descent (SGD) and its variants (e.g., Adam) for minimizing  SGD Algorithm ( is the number of updates) 1. Initialize with random values 2. for ⟵ 1 to : 3. ⟵ 1/ # Learning rate at 4. ( , ) ⟵ an instance chosen from at random 5. ⟵ − − = +
  23. Exercise: compute the gradient 26 Prove (we omit the instance

    index for simplicity): = = � − by computing the gradients and Here: = − � =1 log � , � = = exp( ) ∑=1 exp( ) , = ⋅
  24. Answer: the gradient 27 Because it is easy to find

    = , we concentrate on , = − � =1 log � = − � =1 1 � � = − 1 � � − � ≠ 1 � � The first term is, − 1 � � = − 1 � exp ∑=1 exp = − 1 � exp Σ − exp exp Σ2 = − 1 � exp Σ Σ − exp Σ = − 1 � � 1 − � = − 1 − � = − + � The second term is, − � ≠ 1 � � = − � ≠ 1 � exp ∑ ′=1 exp ′ = − � ≠ 1 � 0 − exp exp Σ2 = � ≠ 1 � � � = � ≠ � Therefore, = − + � + � ≠ � = − + � � =1 = − + �
  25. SGD elaborated for training single-layer NNs 28 1. For every

    , initialize with random values 2. for ⟵ 1 to : 3. ⟵ 1/ 4. ( , ) ⟵ an instance chosen from at random 5. � ⟵ ( ⋅ ) 6. ∀: ⟵ − 𝑘𝑘 = + 𝑛𝑛 − � 𝑛𝑛  The algorithm is the same as that for binary classification  For each category , it updates a weight by the amount of the error (𝑛𝑛 − � 𝑛𝑛 ) between the true probability 𝑛𝑛 and the estimated probability � 𝑛𝑛
  26. Intuitive example of SGD updates ( = 3, � =

    3; = 1) 29 ⋅ = −1.24 ⋅ = −4.30 ⋅ = −0.68 ⋅ = +3.62 ⋅ = −5.61 ⋅ = −1.94 ⋅ = −5.56 ⋅ = −6.86 ⋅ = −0.08 ⋅ = −3.69 0 1 2 3 4 5 6 7 8 9 (0.74%) (0.03%) (1.29%) (95.1%) (0.01%) (0.37%) (0.01%) (0.00%) (2.35%) (0.06%) −= 0.0074 −= 0.0003 −= 0.0129 += 0.0490 −= 0.0001 −= 0.0037 −= 0.0001 −= 0.0000 −= 0.0235 −= 0.0006
  27. Computing the loss with mini-batch 31  Single-batch  Mini-batch

    (parallelizable in CPU/GPU) � × ( ) = 1 1 1 = − ⋅ log � � × ( ) = = − 1 � =1 ⋅ log �
  28. Mini-batch training 32  Most DL frameworks implement mini-batch training

    by increasing the order of tensors:  For example, → (m × )  Increasing the batch size () may:  Speed up time required for an epoch with parallelization  Decrease the number of parameter updates (1/)  This paper (Goyal+ 2017) recommends:  When the minibatch size is multiplied by , multiply the learning rate by Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv:1706.02677.
  29. Mini-batch SGD implemented in pytorch 33 xt: torch.tensor ( ×

    784) yt: torch.tensor () Define a NN model as a sequence of modules Sample instances with the batch size of 256 https://github.com/chokkan/deeplearning/blob/master/notebook/mnist.ipynb
  30. Regularization 34  MLE often causes over-fitting  When the

    training data is linearly separable → ∞ as � =1 → 0  Subject to be affected by noises in the training data  We use regularization (MAP estimation)  We introduce a penalty term when becomes large  The loss function with an L2 regularization term: = − � =1 + 2  is the hyper parameter to control the trade-off between over/under fitting
  31. Summary and notes  -class classification is realized by an

    output layer with dimension  Softmax yields a probability distribution � ∈ ℝ  The loss function compares a model output � with a true category  and � are represented as one-hot vectors  Again, automatic differentiation is also useful for training multi-class NNs  A single-layer NN with softmax activation function is also known as multi-class logistic regression and maximum entropy modeling 35
  32. Generic notation for multi-layer NNs 37  Configurations  The

    number of layers  The numbers of dimensions of hidden layers  An activation function for each layer  A loss function Σ (1) Σ (1) Σ (1) Σ (2) Σ (2) Σ (3) First layer: ℝ2 → ℝ3 (1) = (1) 1 (1) = (1)(0) (1) ∈ ℝ3×2, 1 , 1 ∈ ℝ3 Second layer: ℝ3 → ℝ2 (2) = (2) 2 (2) = (2)(1) (2) ∈ ℝ2×3, 2 , 2 ∈ ℝ2 Final layer: ℝ2 → ℝ2 (3) = (3) 3 (3) = (3)(2) (3) ∈ ℝ2×2, 3 , 3 ∈ ℝ2 1 → ℎ1 (0) 2 → ℎ2 (0) ℎ1 (1) ℎ2 (1) ℎ3 (1) 1 (1) 2 (1) 3 (1) 1 (2) 2 (2) ℎ1 (2) ℎ2 (2) ℎ1 (3) ← 1 1 (3) Σ (3) ℎ2 (3) ← 2 1 (3)
  33. Cross entropy loss (binary) 38 , = − � log

    = 1 log 1 − = 0 = − log () − (1 − ) log 1 −  Cross entropy , = − � log True probability distribution (1 for true category; 0 otherwise) Predicted probability distribution
  34. Cross entropy loss (multi) 39 , = − log exp

    ∑ exp = − + log � exp  Cross entropy , = − � log True probability distribution (1 for true category; 0 otherwise) Predicted probability distribution The probability of the true label estimated by the model ( = − log )
  35. Step 42  Pros  Yields a binary output 

    Cons (never use this)  Zero gradients  SGD cannot update parameters because = 0 Step function: ℝ → {0,1} () = � 1 (if > 0) 0 (otherwise)
  36. Sigmoid 43  Pros  Yields an output within (0,1)

     Cons  Not zero-centered  Zero (vanishing) gradients when || is large Sigmoid: ℝ → (0,1) () = 1 1 + −
  37. Hyperbolic tangent (tanh) 44  Pros  Yields an output

    within (−1,1)  Zero-centered  Cons  Zero (vanishing) gradients when || is large tanh: ℝ → (−1,1) tanh = − − + − = 2 2 − 1
  38. Rectified Linear Unit (ReLU) 45  Pros  Gradients do

    not vanish when > 0  Light-weight (no ) computation  Faster convergence (e.g., 6x faster on CIFAR-10)  Cons  Not zero centered  Dead neurons when ≤ 0 ReLU: ℝ → ℝ≥0 ReLU = max(0, )
  39. Dropout (Srivastava+ 2014) 47  A simple method for preventing

    overfitting  Randomly drops units from a NN during training  Virtually samples an exponential number of different `thinned’ NNs during training  Prevents units from co-adapting too much  In inference (test) time, approximate the effect of averaging the thinned NNs  Simply by using the entire NN with smaller weights  Improves the performance on test data Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(Jun):1929−1958.
  40. Dropout at training phrase 48  For each training instance,

    choose units at random and drop them  Virtually samples an exponential number of different `thinned’ NNs  Train the thinned NNs by the same algorithm to standard NNs Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(Jun):1929−1958. (Srivastava+ 2014)
  41. Dropout at training phrase 49  At training time, choose

    units that probability  At inference (test) time, multiply to the trained weights  This approximates the effect of averaging the predictions from exponentially many thinned models Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(Jun):1929−1958. (Srivastava+ 2014)
  42. Dropout at training phrase 50 Nitish Srivastava, Geoffrey Hinton, Alex

    Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(Jun):1929−1958. � = ⊙ , = Bernoulli() (Srivastava+ 2014)
  43. Dropout in pytorch 51  Simply add a Dropout module

     Do not forget to switch training and test modes  model.train() Units alive with probability  model.eval() Weights multiplied by Dropout module (we can control the dropout rate by specifying one in an argument; = 0.5 by default) https://github.com/chokkan/deeplearning/blob/master/notebook/mnist.ipynb
  44. Summary  We extend binary classification to multi-class classification 

    Assign a weight vector for every category  Extend Perceptron algorithm to multi-class classification  Extend sigmoid function to softmax function  Again, automatic differentiation is useful for SGD training  ReLU is a popular activation function for internal layers  Dropout realizes model ensembling and averaging in a simple way 53