two-layer neural network with: Input: 40,000 dimension (an input image is 200 x 200 pixels) Hidden layer: 20,000 dimension Output: 1,000 (1,000 categories for objects) The number of parameters is huge, c.a. 0.82 billion (1.6GB with float16) 1st layer: 40,000 x 20,000 = 800,000,000 2nd layer: 20,000 x 1,000 = 20,000,000 The number depends on the size of input images This treatment ignores stationarity in images Patterns appearing different positions Positional shifts
of the input image Filter parameters are shared/reused in different blocks Filter parameters are acquired from the supervision data Uses less parameters than fully-connected layer 1,000 filters on 10 x 10 window: only 100,000 parameters (much smaller than a fully connected layer, e.g., 800,000,000 parameters)
to detect a nose, eye and mouth However, we cannot presume where these objects are located How can we incorporate invariance to different positions? Input image (Nose filter) (Mouth filter) (Eye filter)
feature map) Discard exact positions, and focus on rough positions Popular method: max pooling (taking the max within a partition) Input image (Nose filter) (Mouth filter) (Eye filter)
2 1 0 0 3 2 1 1 4 3 2 1 2 2 1 0 3 1 4 2 Max pooling with stride 2x2 Other pooling operations (e.g., average pooling, 2 -norm pooling) are also used (but less popular because of the interest of performance)
outputs from multiple filters? We apply convolutions to the results of the filters, expecting that the filtering results are integrated in the upper layer Input image First layer Second layer
Pooling Fully-connected Input Output Multiple layers Classification A stack of convolution, non-linear, and pooling layers Convolution layer Non-linear transformation (e.g., ReLU) Pooling layer (e.g., max pooling) Followed by fully-connected layer(s) to make predictions Parameters are trained by backpropagation (end-to-end fashion)
workshop for object detection and image classification Allows researchers to compare the algorithms for the tasks Held since 2010 until 2017 Based on the large-scale dataset (ImageNet) ILSVRC uses a subset of ImageNet For example, the training set of the classification task includes about 1.2M images associated with 1,000 categories A driving force for the research on deep learning Convolutional Neural Networks made a remarkable improvements in ILSVRC 2012 Several innovative methods appeared along with the challenges
present in the image (Russakovsky et al., 2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211-252. http://www.image-net.org/challenges/LSVRC/2013/slides/ILSVRC2013_12_7_13_clsloc.pdf
Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211-252. http://www.image-net.org/challenges/LSVRC/2013/slides/ILSVRC2013_12_7_13_clsloc.pdf Algorithms produce a list of object categories present in the image, along with an axis- aligned bounding box indicating the position and scale of one instance of each object category. (Russakovsky et al., 2015)
Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211-252. http://www.image-net.org/challenges/LSVRC/2013/slides/ILSVRC2013_12_7_13_clsloc.pdf Algorithms produce a list of object categories (out of 200 categories) present in the image along with an axis-aligned bounding box indicating the position and scale of every instance of each object category (Russakovsky et al., 2015)
http://www.image-net.org/papers/ImageNet_2010.pdf Categories of ImageNet are defined by WordNet WordNet provides a hierarchy between concepts (ontology)
ILSVRC 2012 Error rate was drastically reduced (from 25.77% to 16.42%) Consists of 5 convolution layers and 3 fully-connected layers The architecture used cutting-edge methods (e.g., ReLU, dropout) Designed to use two GPUs (to fit it into the small GPU memory) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proc. of NIPS, pp. 1097-1105.
The number of channels at each layer is different from that described in the original paper. This is because this implementation is based on an old one in torch7 that fit into a single GPU. See: https://github.com/pytorch/vision/pull/463 3 fully-connected layers
each layer and convolution filter, find the top-9 high outputs Reconstruct the original input using ‘deconvnet’ Inverse transformation: map the outputs back to the input space Impossible to reconstruct the original image completely, but the pixels contributing to the high outputs are highlighted The visualization also shows the original image patch Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proc. of ECCV, pp. 818-833.
Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proc. of ECCV, pp. 818-833. Observations Strong grouping within each filter (feature map) Lower layers tend to focus on primitive shape and patterns Higher layers seem to recognize objects to be classified Exaggeration of discriminative parts of the image Eyes and noses of dogs (layer 4, row 1, col 1, next page) Grass in the background, not the foreground objects (layer 5, row 1, col 2)
local features in a hierarchical network was proposed in 1982 by Kunishiko Fukushima (Fukushima and Miyake, 1982) Kunihiko Fukushima and Sei Miyake. 1982. Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern Recognition, 15(6):455-459.
Yoshua Bengio, and Patrick Haffner. 1998, Gradient-Based Learning Applied to Document Recognition. Proceedings of IEEE, 86(11):2278-2324. The first architecture that is very close to the recent CNNs Proposed for hand-written recognition The model is trained by backpropagation Some differences Sigmoid activation function (instead of ReLU) Subsampling pooling (instead of max pooling)
Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. Proc. of ICLR. Simple and popular architecture of CNN Explores a deeper CNN Mostly uses filters with a small receptive field: 3 x 3 (the smallest size to capture the notion of left/right, up/down, and center) Max-pooling is performed over a 2 x 2 pixel window (i.e., down-sample to half) The number of channels is increased by a factor of 2 after each pooling layer Ranked at the second in ILSVRC 2014
deeper architecture (152 layers) However, deeper networks are difficult to train because of the gradient vanishing problem Proposed a residual learning framework to ease the training of deep networks Residual connection Suppose that we want to learn a function ℎ() We consider another mapping: = ℎ − Then, the original mapping is ()+ We can view + as a feedforward neural network with shortcut connections Training () is easier than ℎ() Batch normalization The winner of ILSVRC 2015 Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. 2016. Deep Residual Learning for Image Recognition. Proc. of CVPR. () +
pooling layers Convolution layer applies a filter to the input (resemblance to image filter) Non-linear transformation (e.g., ReLU) Pooling layer down-samples outputs (e.g., max pooling) After a stack of convolutions, fully-connected layers make predictions Parameters (e.g., filter weights) are trained by backpropagation A lot of innovative ideas improved the performance of image classification in addition to the advances in computation power and big data 52