Convolutional Neural Network

Convolutional Neural Network Naoaki Okazaki School of Computing, Tokyo Institute
of Technology [email protected] PowerPoint template designed by https://ppt.design4u.jp/template/

Foundations of Convolutional Neural Networks 1

Classifying an image using ResNet 50 (CNN) 2 https://github.com/chokkan/deeplearning/blob/master/notebook/resnet.ipynb

Recap: Image classification on MNIST 3 Image (28 × 28)
10 dims .000 .001 .003 .986 … .002 Input vector ∈ ℝ785 � = Output vector � ∈ ℝ10 ∈ ℝ10×785 Single-layer neural network Multi-layer neural network (with ReLU at the 1st layer) � = 2 max 0, 1 1 ∈ ℝℎ×785, 2 ∈ ℝ10×ℎ (ℎ : dimension of the hidden layer)

Fully-connected layer on image is inefficient 4  Consider a
two-layer neural network with:  Input: 40,000 dimension (an input image is 200 x 200 pixels)  Hidden layer: 20,000 dimension  Output: 1,000 (1,000 categories for objects)  The number of parameters is huge, c.a. 0.82 billion (1.6GB with float16)  1st layer: 40,000 x 20,000 = 800,000,000  2nd layer: 20,000 x 1,000 = 20,000,000  The number depends on the size of input images  This treatment ignores stationarity in images  Patterns appearing different positions  Positional shifts

2D Convolution (1/4) 5 Compute a dot product between a
submatrix of a matrix and another matrix 2 × −1 + 1 × −1 + 0 × −1 + 3 × −1 + 2 × 8 + 1 × −1 + 4 × −1 + 3 × −1 + 2 × −1 = −2 − 1 − 3 + 16 − 1 − 4 − 3 − 2 = 0

2D Convolution (2/4) 6 Compute a dot product between a
submatrix of a matrix and another matrix, changing the position of the submatrix (to the right, in this example) 1 × −1 + 0 × −1 + 0 × −1 + 2 × −1 + 1 × 8 + 1 × −1 + 3 × −1 + 2 × −1 + 1 × −1 = −1 − 2 + 8 − 1 − 3 − 2 − 1 = −2

2D Convolution (3/4) 7 3 × −1 + 2 ×
−1 + 1 × −1 + 4 × −1 + 3 × 8 + 2 × −1 + 2 × −1 + 2 × −1 + 1 × −1 = −3 − 2 − 1 − 4 + 24 − 2 − 2 − 2 − 1 = 7

2D Convolution (4/4) (convolve: ぐるぐる回る，畳み込む) 8 2 × −1 +
1 × −1 + 1 × −1 + 3 × −1 + 2 × 8 + 1 × −1 + 2 × −1 + 1 × −1 + 0 × −1 = −2 − 1 − 1 − 3 + 16 − 1 − 2 − 1 = 5

Size of 2D convolution results 9 * = 4 ×
4 3 × 3 Width: (width of input matrix) – (width of weight matrix) + 1 Height: (height of input matrix) – (height of weight matrix) + 1 2 × 2 -1 -1 -1 -1 8 -1 -1 -1 -1

Padding 10 Extend the input matrix to adjust the size
of the output 0 0 0 0 0 0 0 2 1 0 0 0 0 3 2 1 1 0 0 4 3 2 1 0 0 2 2 1 0 0 0 0 0 0 0 0 -1 -1 -1 -1 8 -1 -1 -1 -1 * = 3 × 3 10 0 -5 -2 12 0 -2 4 20 7 5 3 7 4 0 -4 4 × 4 (excluding padding) 4 × 4

Stride 11 2 2 1 1 0 1 2 1
0 0 2 3 2 1 1 3 4 3 2 1 1 2 2 1 0 -1 -1 -1 -1 8 -1 -1 -1 -1 = 4 (stride = 2) Slide the filter with a distance *

Stride 12 (stride = 2) Slide the filter with a
distance 2 2 1 1 0 1 2 1 0 0 2 3 2 1 1 3 4 3 2 1 1 2 2 1 0 -1 -1 -1 -1 8 -1 -1 -1 -1 = 4 -7 *

distance 2 2 1 1 0 1 2 1 0 0 2 3 2 1 1 3 4 3 2 1 1 2 2 1 0 -1 -1 -1 -1 8 -1 -1 -1 -1 = 4 -7 14 *

distance -1 -1 -1 -1 8 -1 -1 -1 -1 = 4 -7 14 -5 2 2 1 1 0 1 2 1 0 0 2 3 2 1 1 3 4 3 2 1 1 2 2 1 0 *

2D Convolution with multiple channels 15 * = 3 4
2 1 3 2 1 1 4 3 2 1 2 2 1 4 2 3 4 2 3 2 1 1 4 3 2 1 2 2 1 1 2 1 0 0 3 2 1 1 4 3 2 1 2 2 1 0 -2 -2 -0 -1 8 -5 -1 -1 -3 -0 -0 -2 -1 8 -0 -1 -1 -3 -1 -1 -1 -1 8 -1 -1 -1 -1 * 2 -8 19 22  A color image is represented with matrices for multiple channels: for example, brightness for red (R), green (G), blue (B)  We extend the filter to multiple channels and compute the sum of all channels to obtain an output matrix with a single channel

2D Convolution with multiple filters 16 * = 3 4
2 1 3 2 1 1 4 3 2 1 2 2 1 4 2 3 4 2 3 2 1 1 4 3 2 1 2 2 1 1 2 1 0 0 3 2 1 1 4 3 2 1 2 2 1 0 -2 -2 -0 -1 8 -5 -1 -1 -3 -0 -0 -2 -1 8 -0 -1 -1 -3 -1 -1 -1 -1 8 -1 -1 -1 -1 * 5 -9 19 26 -2 -2 -0 -1 8 -5 -1 -1 -3 -0 -0 -2 -1 8 -0 -1 -1 -3 0 -1 0 -1 4 -1 0 -1 0 2 -8 19 22

2D Convolution is also known as kernel, mask, and image
filter 17 = 0 0 0 0 1 0 0 0 0 Identity https://github.com/chokkan/deeplearning/blob/master/notebook/convolution.ipynb

Brighten and Darken 18 = 0 0 0 0 0.5
0 0 0 0 Darken = 0 0 0 0 1.5 0 0 0 0 Brighten https://github.com/chokkan/deeplearning/blob/master/notebook/convolution.ipynb

Blur 19 = 1 16 1 2 1 2 4
2 1 2 1 Gaussian blur = 1 9 1 1 1 1 1 1 1 1 1 Box blur https://github.com/chokkan/deeplearning/blob/master/notebook/convolution.ipynb

Sharpen and edge detection 20 = −1 −1 −1 −1
8 −1 −1 −1 −1 Edge detection = 0 −1 0 −1 5 −1 0 −1 0 Sharpen https://github.com/chokkan/deeplearning/blob/master/notebook/convolution.ipynb

Prewitt edge detection 21 = −1 −1 −1 0 0
0 1 1 1 Horizontal edge detection = −1 0 1 −1 0 1 −1 0 1 Vertical edge detection https://github.com/chokkan/deeplearning/blob/master/notebook/convolution.ipynb

2D Convolution represented by dot products of vectors (1/4) 22
* = -1 -1 -1 -1 8 -1 -1 -1 -1 −1 −1 −1 −1 8 −1 −1 −1 −1 2 1 0 3 2 1 4 3 2 = 0 2 1 0 0 3 2 1 1 4 3 2 1 2 2 1 0 0

−1 −1 −1 −1 8 −1 −1 −1 −1 2 1 0 3 2 1 4 3 2 1 0 0 2 1 1 3 2 1 = 0 −2 * = -1 -1 -1 -1 8 -1 -1 -1 -1 2 1 0 0 3 2 1 1 4 3 2 1 2 2 1 0 0 -2

−1 −1 −1 −1 8 −1 −1 −1 −1 2 1 0 3 2 1 4 3 2 1 0 0 2 1 1 3 2 1 3 2 1 4 3 2 2 2 1 = 0 −2 7 * = -1 -1 -1 -1 8 -1 -1 -1 -1 2 1 0 0 3 2 1 1 4 3 2 1 2 2 1 0 0 -2 7

−1 −1 −1 −1 8 −1 −1 −1 −1 2 1 0 3 2 1 4 3 2 1 0 0 2 1 1 3 2 1 3 2 1 4 3 2 2 2 1 2 1 1 3 2 1 2 1 0 = 0 −2 7 5 * = -1 -1 -1 -1 8 -1 -1 -1 -1 2 1 0 0 3 2 1 1 4 3 2 1 2 2 1 0 0 -2 7 5

2D Convolution represented by matrix-vector product 26 −1 −1 −1
−1 8 −1 −1 −1 −1 2 1 0 3 2 1 4 3 2 1 0 0 2 1 1 3 2 1 3 2 1 4 3 2 2 2 1 2 1 1 3 2 1 2 1 0 = 0 −2 7 5 * = -1 -1 -1 -1 8 -1 -1 -1 -1 2 1 0 0 3 2 1 1 4 3 2 1 2 2 1 0 0 -2 7 5

2D Convolution in math (single channel output) 27 ℎ1 =
1 ⋅ Suppose we have sliding blocks in the input ∈ ℝ: Flattened vector of the -th block ∈ ℝ: Flattened weight vector of the filter ℎ ∈ ℝ: Output from the filter for the -th block : Dimension of the vectors (e.g., width × height) 1 2 3 ℎ2 = 2 ⋅ ℎ3 = 3 ⋅ ℎ = ⋅ = ⋯ = ⋯ ℎ1 ℎ2 ℎ3 = 1 … ∈ ℝ× = ℎ1 … ℎ ⊺ ∈ ℝ ℎ

2D Convolution in math (multiple channel output) 28 Suppose that
we have filters ∈ ℝ: Flattened weight vector of the -th filter = 1 … ∈ ℝ×: filters represented as a matrix ∈ ℝ× 1 2 … 1 = 1 = 𝑋𝑋 2 = 2 … = … = = [ ] ∈ ℝ×

Convolution layer 29  Applies multiple filters to small blocks
of the input image  Filter parameters are shared/reused in different blocks  Filter parameters are acquired from the supervision data  Uses less parameters than fully-connected layer  1,000 filters on 10 x 10 window: only 100,000 parameters (much smaller than a fully connected layer, e.g., 800,000,000 parameters)

Translation invariance 30  Assume that three filters are trained
to detect a nose, eye and mouth  However, we cannot presume where these objects are located  How can we incorporate invariance to different positions? Input image (Nose filter) (Mouth filter) (Eye filter)

Pooling 31  Down-sample outputs from a filter (shrinking a
feature map)  Discard exact positions, and focus on rough positions  Popular method: max pooling (taking the max within a partition) Input image (Nose filter) (Mouth filter) (Eye filter)

Max pooling 32 Extract the maximum value in a partition
2 1 0 0 3 2 1 1 4 3 2 1 2 2 1 0 3 1 4 2 Max pooling with stride 2x2 Other pooling operations (e.g., average pooling, 2 -norm pooling) are also used (but less popular because of the interest of performance)

Integrating primitive detection results 33  How can we integrate
outputs from multiple filters?  We apply convolutions to the results of the filters, expecting that the filtering results are integrated in the upper layer Input image First layer Second layer

Convolutional Neural Network (CNN) 34 Convolution Non-linear Pooling Convolution Non-linear
Pooling Fully-connected Input Output Multiple layers Classification A stack of convolution, non-linear, and pooling layers  Convolution layer  Non-linear transformation (e.g., ReLU)  Pooling layer (e.g., max pooling) Followed by fully-connected layer(s) to make predictions Parameters are trained by backpropagation (end-to-end fashion)

Convolutional Neural Network 35

36 ImageNet Large-scale Visual Recognition Challenge (ILSVRC) http://www.image-net.org/challenges/LSVRC/  Evaluation
workshop for object detection and image classification  Allows researchers to compare the algorithms for the tasks  Held since 2010 until 2017  Based on the large-scale dataset (ImageNet)  ILSVRC uses a subset of ImageNet  For example, the training set of the classification task includes about 1.2M images associated with 1,000 categories  A driving force for the research on deep learning  Convolutional Neural Networks made a remarkable improvements in ILSVRC 2012  Several innovative methods appeared along with the challenges

Performance improvements on the classification task 37 http://www.image-net.org/challenges/LSVRC/ 28.19 25.77
16.42 11.74 6.66 3.57 2.99 2.25 0 5 10 15 20 25 30 NEC-UIUC (2010) XRCE (2011) SuperVision (AlexNet) (2012) Clarifai (ZFNet) (2013) GoogLeNet (2014) MSRA (ResNet) (2015) Trimps-Soushen (ResNet) (2016) WMW (SENet) (2017) Error rate [%] Improvements from Convolutional Neural Networks

38 Classification task Algorithms produce a list of object categories
present in the image (Russakovsky et al., 2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211-252. http://www.image-net.org/challenges/LSVRC/2013/slides/ILSVRC2013_12_7_13_clsloc.pdf

39 Single-object localization Olga Russakovsky, Jia Deng, Hao Su, Jonathan
Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211-252. http://www.image-net.org/challenges/LSVRC/2013/slides/ILSVRC2013_12_7_13_clsloc.pdf Algorithms produce a list of object categories present in the image, along with an axis- aligned bounding box indicating the position and scale of one instance of each object category. (Russakovsky et al., 2015)

40 Detection task Olga Russakovsky, Jia Deng, Hao Su, Jonathan
Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211-252. http://www.image-net.org/challenges/LSVRC/2013/slides/ILSVRC2013_12_7_13_clsloc.pdf Algorithms produce a list of object categories (out of 200 categories) present in the image along with an axis-aligned bounding box indicating the position and scale of every instance of each object category (Russakovsky et al., 2015)

41 ImageNet includes 14M+ images annotated with 20k+ categories http://image-net.org/explore_popular.php

ImageNet is a knowledge ontology 42 This slide is from:
http://www.image-net.org/papers/ImageNet_2010.pdf  Categories of ImageNet are defined by WordNet  WordNet provides a hierarchy between concepts (ontology)

AlexNet (Krizhevsky et al., 2012) 43  The winner of
ILSVRC 2012  Error rate was drastically reduced (from 25.77% to 16.42%)  Consists of 5 convolution layers and 3 fully-connected layers  The architecture used cutting-edge methods (e.g., ReLU, dropout)  Designed to use two GPUs (to fit it into the small GPU memory) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proc. of NIPS, pp. 1097-1105.

Implementation of AlexNet in pytorch 44 https://pytorch.org/docs/stable/_modules/torchvision/models/alexnet.html 5 convolution layers
The number of channels at each layer is different from that described in the original paper. This is because this implementation is based on an old one in torch7 that fit into a single GPU. See: https://github.com/pytorch/vision/pull/463 3 fully-connected layers

Visualizing internal layers (Zeiler and Fergus, 2014) 45  For
each layer and convolution filter, find the top-9 high outputs  Reconstruct the original input using ‘deconvnet’  Inverse transformation: map the outputs back to the input space  Impossible to reconstruct the original image completely, but the pixels contributing to the high outputs are highlighted  The visualization also shows the original image patch Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proc. of ECCV, pp. 818-833.

Visualizing internal layers (Zeiler and Fergus, 2014) 46 Matthew D.
Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proc. of ECCV, pp. 818-833.  Observations  Strong grouping within each filter (feature map)  Lower layers tend to focus on primitive shape and patterns  Higher layers seem to recognize objects to be classified  Exaggeration of discriminative parts of the image  Eyes and noses of dogs (layer 4, row 1, col 1, next page)  Grass in the background, not the foreground objects (layer 5, row 1, col 2)

Visualizing internal layers (Zeiler and Fergus, 2014) 47 Matthew D.
Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proc. of ECCV, pp. 818-833.

Neocognitron (Fukushima and Miyake, 1982) 48 The idea of integrating
local features in a hierarchical network was proposed in 1982 by Kunishiko Fukushima (Fukushima and Miyake, 1982) Kunihiko Fukushima and Sei Miyake. 1982. Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern Recognition, 15(6):455-459.

LeNet (LeCun et al., 1998) 49 Yann LeCun, Léeon Bottou,
Yoshua Bengio, and Patrick Haffner. 1998, Gradient-Based Learning Applied to Document Recognition. Proceedings of IEEE, 86(11):2278-2324. The first architecture that is very close to the recent CNNs  Proposed for hand-written recognition  The model is trained by backpropagation Some differences  Sigmoid activation function (instead of ReLU)  Subsampling pooling (instead of max pooling)

50 VGGNet (Simonyan and Zisserman, 2015) Karen Simonyan and Andrew
Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. Proc. of ICLR.  Simple and popular architecture of CNN  Explores a deeper CNN  Mostly uses filters with a small receptive field: 3 x 3 (the smallest size to capture the notion of left/right, up/down, and center)  Max-pooling is performed over a 2 x 2 pixel window (i.e., down-sample to half)  The number of channels is increased by a factor of 2 after each pooling layer  Ranked at the second in ILSVRC 2014

51 ResNet (He et al., 2016)  Explores a more
deeper architecture (152 layers)  However, deeper networks are difficult to train because of the gradient vanishing problem  Proposed a residual learning framework to ease the training of deep networks  Residual connection  Suppose that we want to learn a function ℎ()  We consider another mapping: = ℎ −  Then, the original mapping is ()+  We can view + as a feedforward neural network with shortcut connections  Training () is easier than ℎ()  Batch normalization  The winner of ILSVRC 2015 Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. 2016. Deep Residual Learning for Image Recognition. Proc. of CVPR. () +

Summary  CNN is a stack of convolution, non-linear, and
pooling layers  Convolution layer applies a filter to the input (resemblance to image filter)  Non-linear transformation (e.g., ReLU)  Pooling layer down-samples outputs (e.g., max pooling)  After a stack of convolutions, fully-connected layers make predictions  Parameters (e.g., filter weights) are trained by backpropagation  A lot of innovative ideas improved the performance of image classification in addition to the advances in computation power and big data 52

Convolutional Neural Network

Convolutional Neural Network

More Decks by Naoaki Okazaki

Other Decks in Research

Featured

Transcript