Deep learning for Computer Vision

Deep Learning For Computer Vision PytzMLS2018: CIVE UDOM Anthony Faustine
PhD Fellow (IDLab research group-Ghent University) 5 April 2018 1

Learning goal • Understand how to build and train Convolution
Neural Networks (CNN). • Learn how to apply CNN to to visual detection and recognition tasks. • Learn how to apply Transfer learning with image and language data. • Understand how to implement Convolution Neural Network using Pytorch framework. 2

Outline Introduction Neural Networks For Visual Data Computer vision tasks
Deep convolutional models Transfer learning 3

Introduction: MLP Limitations So far we have learned MLP as
a universal function approximator which can be used for classiﬁcation or regression problem. • They build up complex pattern from simple pattern hierachically. • Each layer learn to detect simple combination of pattern detected by previous layer. • The lowest layers of the model capture simple patterns where the next layers capture more complex pattern. 3

Introduction: MLP Limitations Consider the following three problems. Problem 1:
Given speach signal below Task: Detect if the signal contain the word HAPA KAZI TU 4

Given following image Task: Idenify zebra in the image 5

Given following two images. (a) Image 1 (b) Image 2 Figure 1: Zebra Task: Classify the image as zebra regardless of the orientation of zebra in the image. 6

Introduction: MLP Limitations Composing MLP for these kind of problems
is very challenging. 1 Require a very large network 2 MLPs are sensitive to the location of the pattern • Moving it by one component results in an entirely diﬀerent input that the MLP wont recognize. In many problems the location of a pattern is not important • Only the presence of the pattern. • Requirement: Network must be shift invariant. More details 7

Convolutioanl Neural Network (CNN) Neural networks for visual data are
designed speciﬁcally for such problems: • Handle very high input dimension • Exploit the 2D topology of image or 3D topology for video data. • Build in invariance to certain variations we expect (translations, illumination etc) 8

Convolutional Neural Networks (CNN) CNN are specialized kind of neural
networks for processing visual data. • They employs a mathematical operation called convolution in place of general matrix multiplication in at least one of their layers. • CNNs are often used for 2D or 3D data (such as grayscale or RGB images), but can also be applied to several other types of input, such as: 1 1D data: time-series, raw waveforms 2 2D data: grayscale images, spectrograms 3 3D data: RGB images, multichannel spectrograms 9

Convolutional Neural Networks (CNN) Convolution leverages three important ideas that
help improve a machine learning system. 1 Sparse interactions (local connectivity), 2 Parameter sharing, 3 Equivariant representations 10

CNN: Local connectivity Unlike MLP, a feature at any given
CNN layer only depends on a subset of the input of that layer. • Each hidden unit is connected only to the subregion of the input image. • This reduce the number of parameter. • Reduce the cost of computing linear activations of the hidden units. Figure 2: Local connectivity: credit: Prof. Seungchul Lee 11

CNN: Parameter Sharing At each CNN layer, we learn several
small ﬁlters (feature maps) and apply them to the entire layer input. • Units organized into the same feature map share parameters. • Hidden units within a feature map cover diﬀerent positions in the image. • Allow feature to be detected regardless of their position. Figure 3: Parameter sharing: credit: Hugo Larochelle 12

CNN: Equivariant representations A feature map (ﬁlter) that detects e.g.
an eye can detect an eye everywhere on an image (translation invariance) • Units organized into the same feature map share parameters. • Hidden units within a feature map cover diﬀerent positions in the image. • Allow feature to be detected regardless of their position. Figure 4: credit: Hugo Larochelle 13

CNN Architecture A typical layer of a convolutional network consists
of three layers: • Convolutional layer • Detector stage • Pooling layer and • Fully connected layer 14

CNN Architecture: Convolutional layer This is the first layer in
CNN and consist of set of independent filters that can be sought as feature extractor. • The result is obtained by taking the dot product between the filter w and the small 3 × 3 × 1 chunck of the image x plus bias term b as the filter slides along the image. wTx + b • The step size of slide is called stride ⇒ controls how the filter convolves around the input volume. Demo 15

CNN Architecture: Convolutional layer Consider more two ﬁlters • If
we have three ﬁlters of size 3 × 3 × 1 we get 3 separate activation maps stacked up to get a new volume of size 5 × 5 × 3 16

CNN Architecture: Convolutional operations Figure 5: Conv operation credit: Adam
Gibson and Josh Patterson 17

CNN Architecture: Padding Consider the following 7 × 7 ×
1 images convolved with 3 × 3 × 1 ﬁlter and stride size of 1. • If the size of image is N × N, and that of ﬁlter is F × F and S is the stride size S. • The size of the feature map (output size) is N−F S + 1 • For above image: N = 7, F = 3 18

CNN Architecture: Padding Consider the following 7 × 7 ×
1 images convolved with 3 × 3 × 1 ﬁlter and stride size of 1. For above image: N = 7, F = 3 • Stride 1 S = 1, ⇒ 7−3 1 + 1 = 5 • Stride 2 S = 2, ⇒ 7−3 2 + 1 = 3 • Stride 3 S = 3, ⇒ 7−3 3 + 1 = 2.33 Does not ﬁt 19

CNN layers: Padding For above image: N = 7, F
= 3 Stride 3 S = 3, ⇒ 7−3 3 + 1 = 2.33 Does not ﬁt • To address this we pad the input with suitable values (padding with zero is common)⇒ to preserve the spatial size. • In general common to see convolutional layers with stride 1, ﬁlter F × F and zero padding with P = F −1 2 F = 3 ⇒ zero pad with P = 1 F = 5 ⇒ zero pad with P = 2 F = 7 ⇒ zero pad with P = 3 20

CNN layers: Hyper-parameters To summarize the conv layer • Accepts
a volume of size W1 × H1 × D1 • Requires four hype-parameters: 1 Number of filters K. 2 Spatial extent of filter F. 3 Amount zero padding P. Common settings: • K = (power of 2 e.g) 4, 8, 16, 32, 64, 128 • F = 3, S = 1, P = 1 • F = 5, S = 1, P = 2 • F = 5, S = 2, P =? whatever fits. • Produce a volume of size W2 × H2 × D2 where W2 = (W1 − F + 2P)/S + 1 H2 = (H1 − F + 2P)/S + 1 D2 = K • The number of weights per filter is F · F · D1 and the total number of parameters is (F · F · D1 ) · K and K biases. 21

CNN layers: Pytorch Implementation torch.nn.Conv2d(in_channels, out_channels,kernel_size, stride=1, padding=0) • in_channels
(int) – Number of channels in the input image • out_channels (int) – Number of channels produced by the convolution • kernel_size (int or tuple) – Size of the convolving kernel • stride (int or tuple, optional) – Stride of the convolution. Default: 1 • padding (int or tuple, optional) – Zero-padding added to both sides of the input. 22

CNN Architecture: Detection layer In this stage each feature map
of a conv layer is run through a non-linear function. • ReLU function is often used after every convolution operation. • It replace all the negative pixel in the feature map by zero. 23

CNN Architecture: Pooling layer A pooling layer act as down-sampling
ﬁlter ⇒ takes each feature map from a convolution layer produce a condensed feature map. • Make representation smaller and more manageable. • Operates over each activation map independently • Reduce computational cost and the amount of parameter. • Preserve spatial invariance. 24

CNN Architecture: Pooling layer Max Pooling Figure 6: Max pooling
(credit: CS231n Stanford University) • Other pooling functions: average pooling or L2-norm pooling. 25

CNN Architecture: Pooling layer To summarize the pooling layer. •
Accepts a volume of size W1 × H1 × D1 • Requires two hype-parameters: 1 Spatial extent of ﬁlter F. 2 Stride S. Common settings: • F = 2, S = 2 • F = 3, S = 2 • Produce a volume of size W2 × H2 × D2 where W2 = (W1 − F)/S + 1 H2 = (H1 − F)/S + 1 D2 = D1 • Introduce zero parameters since it computes ﬁxed function of input. • Not common to use zero-padding for pooling layers. 26

Pooling layer: Pytorch Implementation torch.nn.MaxPool2d(kernel_size, stride) • kernel_size (int or
tuple) – Size of the convolving kernel • stride (int or tuple, optional) – Stride of the convolution. Default: 1 27

Convolutional Architecture: Fully connected layer In the end it is
common to add one or more fully connected (FC) layer. • Contains neuron that connect the entire input volume as in MLP. Figure 7: credit: Arden Dertat 28

Convolutional Architecture class CNN(nn.Module): def __init__(self): super(CNN, self).__init__() self.conv1 =
nn.Conv2d(3, 6, 5) self.conv2 = nn.Conv2d(6, 16, 5) self.mp = nn.MaxPool2d(2, 2) self.fc1 = nn.Linear(16*53*53, 120) self.fc2 = nn.Linear(120, 10) def forward(self, x): in_size = x.shape[0] out = F.relu(self.conv1(x)) out = self.mp(out) out = F.relu(self.conv2(out)) out = self.mp(out) out = out.view(in_size, -1) out = F.relu(self.fc1(out)) out = self.fc2(out) return out 29

CNN applications: Image classification Image Classification: Classify an image to
a specific class. • The whole image represents one class. • We don’t want to know exactly where are the object → only one object is presented. The standard performance measures are: • The error rate P(f(x; θ) = y) or accuracy P(f(x; θ) = y) • The balanced error rate (BER) 1 K K i=1 P(f(x; θ) = yi |y = yi ) 30

CNN applications: Image classiﬁcation In the two-class case we can
use True Positive (TP) and False Postive (FP) rate as: • TP = P(f(x; θ) = 1|y = 1) and FP = P(f(x; θ) = 1|y) = 0 • The ideal algorithm would have TP 1 and FP 0 Other standard performance representation: • Receiver operating characteristic (ROC) • Area under the curve AUC) Figure 8: credit:Stanford CS 229: Machine Learning 31

CNN applications: Classiﬁcation with localization Image classiﬁcation with localization: aims
at predicting classes and locations of targets in an image. • Learn to detect a class and a rectangle of where that object is. A standard performance assessment considers • a predicted bounding box ˆ B is correct if there is an annotated bounding box ˆ B for that class: such that the Intersection over Union (IoU) is large enough. area(B ∩ ˆ B) area(B ∪ ˆ B) ≥ 1 2 32

CNN applications: Object detection Given an image we want to
detect all the object in the image that belong to a speciﬁc classes and give their location. • An image may can contain more than one object with diﬀerent classes. 33

CNN applications: Image segmentation Image segmentation: consists of labeling individual
pixels with the class of the object it belongs to ⇒ It may also involve predicting the instance it belongs to. Two types 1 Semantic Segmentation: Label each pixel in the image with a category label. 2 Instance Segmentation: Label each pixel in the image with a category label and distinguish them. 34

Deep Convolutional Architecture Several deep CNN architecture that works well
in several tasks have been proposed. • LeNet-5 • AlexNet • VGG • ResNet • Inception 35

Transfer learning Transfer learning: The ability to apply knowledge learned
in previous tasks to novel tasks. • Based on human learning. People can often transfer knowledge learnt previously to novel situations. Figure 9: credit: Romon Morros 36

Transfer learning Transfer learning Idea: Instead of training a deep
network from scratch for your task: • Take a network trained on a diﬀerent domain for a diﬀerent source task. • Adapt it for your domain and your target task. • A popular approach in computer vision and natural language processing task. 37

Why Transfer learning • In practice, very few people train
an entire CNN from scratch (with random initialization) ⇒ (computation time and data availability) • Very Deep Networks are expensive to train.For example, training ResNet18 for 30 epochs in 4 NVIDIA K80 GPU took us 3 days. • Determining the topology/ﬂavour/training method/hyper parameters for deep learning is a black art with not much theory to guide you. 38

References I • Deep learning for Artiﬁcial Intelligence master course:
TelecomBCN Bercelona(winter 2017) • 6.S191 Introduction to Deep Learning: MIT 2018. • Deep learning Specilization by Andrew Ng: Coursera • Introductucion to Deep learning: CMU 2018 • Cs231n: Convolution Neural Network for Visual Recognition: Stanford 2018 • Deep learning in Pytorch, Francois Fleurent: EPFL 2018 39

Deep learning for Computer Vision

Deep learning for Computer Vision

More Decks by sambaiga

Featured

Transcript