Neural Networks This tutorial consists of 3 parts. Q&A and breaks are inserted between each part. You can actively interrupt me for asking any question!!
Candidates Signal A Signal B Signal C Signal D … … Prior information plays a role in selecting a solution from candidates - Example 1: Smoothness prior - This may select Signal A - Example 2: Periodicity prior - This may select Signal D PRIOR IS ALL YOU NEED !
image as a vector in many cases Consider making an array of all pixel values into a vector 1 2 3 . . . . . . . . . . N . . . . . . . . . . . . 1 2 3 . . . . . . . . . . N .... For example, a gray-scale image of size 100 × 100 ➔ 10000-dimensional vector 100 100 10000
make solution space smaller or unique. In practice, we consider the following optimization problem. : penalty function for noise prior : penalty function for image prior solution space Priors
class of image reconstruction methods Many image reconstruction methods can be produced from above framework When both and are convex, it can be solved by convex optimizers: Alternating direction methods of multiplier (ADMM), Primal-dual hybrid gradient (PDHG) Noise prior Gaussian noise Laplacian noise Poisson noise ×
to discuss the image priors Image priors can be characterized as a combinations of two elements Representation vector matrix tensor graph function in Fourier space in Wavelet coefficient space in other linear transforms Properties etc sparsity group sparsity low-rank orthogonality smoothness non-negativity self-similarity ×
sparsity is number of non-zero entries It is called as “L0-norm” Its convex relaxation is “L1-norm” Both L0-norm and L1-norm penalty prefer sparsity 1 0 0 0 3 0 0 2 1 2 5 4 3 2 1 2 sparse dense
sparse. Following two sparsity inducing operators can be obtained L0-norm minimization L1-norm minimization Hard thresholding Soft thresholding input output input output
linear transform For example, it can be solved by ADMM (alternating direction methods of multiplier) equivalent problem augmented Lagrangian alternating update
linear transform For example, it can be solved by ADMM (alternating direction methods of multiplier) equivalent problem augmented Lagrangian alternating update
must prefer smooth signals and hate non-smooth signals Sum of length of difference QV (Quadratic Variation) for one dimensional signal TV (Total Variation) for one dimensional signal smooth not smooth 1 1 1 1 1 1 1 1 4 3 5 2 0 5 4 1
be seen as block sparseness of images under linear transform Normal sparse is “there are fewer non-zero entries in a vector” Block sparse is “there are fewer non-zero vectors in a matrix” For example, Block sparseness under linear transform zero vectors non-zero vectors
of images of “There are many similar patches (blocks) in a single image” Many images are constructed by planes, lines, and textures. It applies to both images of natural and artificial objects. similar
methods for image denoising using non-local similarity prior of images. In roughly , NLM/BM3D applies the following processes for all patches: Select a reference patch from the image Extract a similar patches from the image by template matching Denoise the reference patch by weighted average of similar patches The denoised patch is returned to original position in the image ※ This is not strict explanation, but we use it for easy or intuitive understanding. ※ reference patch similar patches by template matching weighted average or filter
a linear transform) The weight matrix is important. represents the weight of connection between i-th and j-th pixels. For weighted mean must be satisfied. Calculation method of determines the nature of the filter For example, it is calculated by i-th and j-th patches Mathematical formulation of non-local filter i-th patch j-th patch normalization constant
filters are prepared: Local filter vs non-local filter Original Non-local filter constructed by noisy signal Local filter (moving average) Noisy High frequency components disappear Noise is reduced successfully
structure of pixels Node : pixel Edge : similarity between pixels In a sense, non-local is a re-definition of locality based on the graph local connection 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 non-local connection Isotropic diffusion An-isotropic diffusion
are introduced Sparseness Hard / Soft-thresholding Sparse regularization under linear transform Smoothness Quadratic variation Total variation Non-local similarity Non-local filter Summary of Part I = + known unknown known unknown estimate
as a combinations of two elements From now, we focus on matrix and tensor representation and its low-rank approximation Representation vector matrix tensor graph function in Fourier space in Wavelet coefficient space in other linear transforms Properties etc sparsity group sparsity low-rank orthogonality independency non-negativity self-similarity ×
this part There are various low-rank models for tensors Matrix factorization Nuclear norm regularization CP decomposition, Tucker decomposition TT decomposition, TR decomposition etc = Tensor representation Low-rank approximation
4th order tensor A block 3rd order tensor of which elements are vectors A block matrix of which elements are matrices A block vector of which elements are 3rd order tensors 5th order tensor A block 4th order tensor of which elements are vectors A block 3rd order tensor of which elements are matrices A block matrix of which elements are 3rd order tensors A block vector of which elements are 4th order tensors Nth order tensor A block vector of which elements are (N-1)th order tensors Recursive representations of Nth order tensor ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・・・ ・・・ ・・・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・・・ ・・・ ・・・ ・ ・ ・ Block matrix Block matrix Block vector Block vector
Computer Science / Data Engineering (1990s-) Multi-media signal processing (audio, speech, image, video) Bio-medical signal processing (EEG, MEG, functional MRI, diffusion MRI) Bio-informatics (gene, protein) Remote sensing (Hyperspectral image) Web/data mining (SNS analysis, recommender system) Non-linear time series analysis Wireless communications Deep learning etc Fields of applications ID 001 ID 002 ID 003 ID 004
of two any non-zero vectors Matrix rank The minimum number of rank 1 matrices required to make for any All columns have same direction All rows have same direction
for matrix analysis Matrix rank can be obtained by number of non-zero singular values = Left singular vectors Left singular vectors Diagonal matrix of singular values
implied Spatial continuity is ignored Matrix rank has invariance in switching columns or rows Similar Similar ≈ ≈ Switch R Rank is R Rank is R R R R R R Rank is R ≈
is known as singular value shresholding It can be regarded as soft thresholding of singular values Singular value shresholding Use SVD of Y Soft thresholding input output Solution is given by
decomposition General definition may be “a tensor represented by multiple tensors” A specific decomposition is given by a parametric model f and its parameter tensors. f( , , ) Tensor decompositions = Matrix decomposition = core tensor factor matrix
low-rank based tensor data recovery Low-n-rank tensor recovery [Gandy+, 2011 (in Iverse Problems)] Low rank tensor completion [Liu+, 2012 (in IEEE-TPAMI)] Bayesian CP decomposition [Zhao+, 2015 (in IEEE-TPAMI)] t-SVD [Zhang+, 2017 (in IEEE-TSP)] Low-rank and smooth tensor recovery [Yokota+, 2017 (in CVPR)] [Yokota+, 2019 (in IEEE-Access)] Tensor ring completion [Wang+, 2017 (in ICCV)] Note that most of them are focusing on image inpainting (tensor completion) Therefore, we introduce on image inpainting methods in this part
penalty function (nuclear norms) When is convex, it can be solved by convex optimizer such as ADMM, PDS. Obtaining explicit low-rank tensor decompositions Incomplete tensor Projection operator Set of observed entries Entry of : Set of core tensor and factor matrices : Penalty for core tensor and factor matrices such as orthogonality, sparseness, smoothness
based on CP decomposition (CPD) Tensor nuclear norm based on Tucker decomposition (TKD) Tensor nuclear norms Since obtaining CPD is NP-hard, it is not useful. n-th mode unfolding It is convex function, and quite useful.
IEEE-TPAMI)] It can be solved by ADMM LRTC: Low rank tensor completion equivalent problem augmented Lagrangian alternating update 1st mode 2nd mode 3rd mode target tensor Projection for X Singular value thresholding for Y Update for Z Indicator function
It is a convolutional tensor decomposition model based on t-product It can be obtained by sequential steps = t-product DFT along the 3rd mode SVD of each frontal slices Inverse DFT
2017 (in IEEE-TSP)] A tensor nuclear norm based on tSVD Tubal nuclear norm is same as a matrix nuclear norm of its block circulant one. = Block circulant matrix Block diagonalization DFT blcirc
It can be obtained by the following steps: DFT → SVD → Soft-thresh → inverse DFT DFT along the 3rd mode SVD of each frontal slices Inverse DFT Soft thresholding = =
function is non-convex with respect to But, it is convex with respect to each ALS can be applied to wide range of TD models (CPD, TKD, TT, and TR) Alternating least squares (ALS) for TD …
is a critical problem in TD Rank increment strategy works well!! Start from R=1, optimization, RR+1, optimization, RR+1, …, until converge RR+1 Rank increment Optimize
Penalty function based approach TD based approach Low-rank, smoothness, sparseness, non-negativity, etc Typical combinations is Low-rank and Smoothness Multiple priors
decomposition When whole slices are missing in a tensor, low-rank tensor decomposition does not work Random pixel missing can be recovered, but slice can not be recovered. Since slice missing deflates the matrix/tensor rank, its deflated component can not be recovered. Slices & random missing Recovered by low-rank decomposition
and “decompose” A question here is “how to tensorize the signals” Generally, some specific knowledge on the signals is necessary to design the tensors Here, I introduce the Direct folding Ket augmentation Multi-way Hankelization signals tensors decomposition
represented by matrix, but folding have some benefits for compression (256,256) Low rank approximation #param = 2048 #param = 2048 16th order tensor (2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2) Low rank approximation Folding
use low-rank tensor decomposition not in direct tensor space, but in embedded (Hankel) tensor space. height width rgb Low-rank tensor completion Delay embedding / Hankelization High order incomplete tensor Low-rank tensor completion Inverse operation output High order completed tensor ☺
of DE? There is no inversion of DE since it is essentially a duplication We only consider the pseudo-inverse of DE folding unfolding duplicate … average … average duplicate … ☺ ?
Hankelized by DE It obtains time-series of Hankel matrices Further DE obtains “block Hankel matrix” Block Hankel matrix is defined as a block matrix which has Hankel structure and its all elements are also Hankel matrices Block Hankel structure … … … … … … … … … … DE DE a b c d z A B C D A B C B C D C D E Z Z
multilinear duplication + folding We call it as “Multiway delay-embedding transform (MDT)” MDT: A generalization of DE for tensors = I τ*(I-τ+1) I-τ+1 τ = 3rd order tensor (I1, I2, I3) 6th order tensor
tensor Complete block Hankel tensor Complete tensor Low-rank Tucker decomposition in embedded space [Yokota+, 2018 (in CVPR)] ① MDT of given incomplete tensor of original size (Nth order ➔ 2Nth order) ② Tucker decomposition of incomplete block Hankel tensor ③ Inverse MDT of comlete block Hankel tensor (2Nth order ➔ Nth order) MDT Inv. MDT Tucker Decomp.
I explain the method for Tucker decomposition of incomplete tensor Optimization problem can be given by It can be solved by auxiliary function approach and ALS algorithm Update Tucker decomposition of an incomplete tensor : Incomplete tensor : binary tensor of same size to → 0 represents missing entry → 1 represents observed entry ALS to minimize alternative optimization To convergence ↑ it is same as Tucker for complete tensor
matrix and tensor representation Penalty function based approach Matrix nuclear norm works well for low-rank prior Several nuclear norm for tensors TKD based nuclear norm tSVD based nuclear norm Their problem can be solved by ADMM or PDS TD based approach There are many TD models, TD is solved by ALS Tensor completion can be solved by MM Rank increment works well for adaptive rank selection Low-rank prior under tensorization is hot now.
a signal obtained by repeatedly applying linear operations and nonlinear activations to an input signal. Various functions can be expressed by adjusting the weight matrix used for the linear operation of each layer as a parameter. Universal approximation ability A two-layer NN consisting of an infinite number of nodes can approximate any function Deep structure It is good for approximating complicated functions It helps supervised and unsupervised learning using large amount of data Neural Networks and Machine Learning
input x and output y (training data) Learning process Step 1: Design mathematical model Step 2: Fit the mathematical model to training data Trained model is obtained as follow Supervised learning 〇 : training samples
Visual Recognition Challenge 2012 A huge database of images: ImageNet Train about 1.2 million image data Compete for 1000 class classification tasks Deep learning Univ. Toronto
are mapped to the appropriate coordinates in the latent space. Due to the low dimensionality of the latent space, similar images are closely located. Continuous changes of images on the latent space can be seen ・・・ ・・・ Latent space Encoder Decoder
It performs adversarial training with a generator and a discriminator. Generator: Optimize to generate fake images so that it can deceive the discriminator Discriminator: Optimize to accurately classify fake images and real images Latent space Normal distribution Discriminator Generator (Decoder) Fake Real Real or Fake It is like a game of AI vs AI
corrupted image and outputs clean image Corrupted images are created by a pseudo manner from clean image database Optimize parametric function to fit the training data Corrupted image Clean image
[Ledig+, CVPR2017] Create many low-resolution and high-resolution image pairs and learn the input / output relationship with a CNN High resolution image Low resolution image
SRResNet with adversarial regularization SR image (fake) LR image {True, Fake} HR image (real) Content loss (VGG) Adversarial regularizer - G deceives D - D keeps accuracy Content loss for SRResNet
image restoration Training phase: Learn generative model from set of clean images Image restoration based on data-driven prior: ML based Super-Resolution Data-driven prior sphere surface Error bound
bias Skin color, hair color, hairstyle, glasses, gender, age, clothes, accessories, etc StyleGAN can generate high quality HR images However, these generated images are still in some specific domain (not universal) original Down scaled Upscaled by PULSE ※Demo code is available at [github]
in [Ulyanov+, 2018] Noise impedance was shown Four targets were tested to reconstruct Natural image Natural image with noise Natural image with pixel shuffle Uniform noise Convergence speeds were different Natural image was faster Noise was slower DIP exploits the delay of times However, it is just a report of phenomenon Question: Why CNNs? Why convolution?
has some image prior However, the prior is difficult to be explained with words MDT × Low-rank model Convolution=DE + linear transform It has some relationship with CNN
structure using MDT [Yokota+, 2020 (in IEEE-TNNLS)] Simple structure (MDT ➔ MLP ➔ inverse MDT) Each column vector (image patch) is independently transformed by MLP MLP is trained as Auto-Encoder (AE), and it can be interpreted as manifold learning In practice, Denoising-Auto-Encoder (DAE) is employed MDT … … Inv MDT Multi Layer Perceptron (MLP) Train as Denoising-Auto-Encoder (DAE)
interpreted as optimizing the following problem Prior in MMES is the low-dimensionality of the patch manifold MDT Prior: Low dimensional patch manifold … … vectorize
interpreted as optimizing the following problem Prior in MMES is the low-dimensionality of the patch manifold Replacing the manifold with MLP(AE), it can be transformed as MDT Prior: Low dimensional patch manifold Error term Regularization term
+noise … … Reconstructed patches Patch aggregation Patch extraction ・Compression works well when many similar patches ➔ Low-dimensionality can be image prior ➔ It can be regarded as non-local similarity ・Denoising-Auto-Encoder(DAE) learns low- dimensional manifold from patch distribution Data Fidelity Corrupted image Interpretation of DAE [Vincent+, 2010]
Deep Learning toolbox (TensorFlow etc) Cost function is minimized by using Adam optimizer Various task can be solved by same framework MDT … … Inv MDT AE loss loss mask
etc) Cost function is minimized by using Adam optimizer Various task can be solved by same framework Implementation of MMES: Deblurring MDT … … Inv MDT AE loss loss blur
etc) Cost function is minimized by using Adam optimizer Various task can be solved by same framework Implementation of MMES: Super-Resolution MDT … … Inv MDT AE loss loss Down simple
data-driven prior from a large amount of data SRCNN, SRGAN, PULSE Network structure (CNN) provides image prior Deep image prior, Patch manifold model ML based image restoration has much attentions Data-driven DIP Patch manifold Accuracy ◎ ◦ ◦ Computation Fast Slow Slow Training data Necessary Not necessary No necessary Interpretability Depend on data Unclear Clear
organizers of APSIPA ASC 2021 and all participants!! In this tutorial, I tried to summarize prior-based image restoration methods from multiple perspectives: classical methods (sparseness, smoothness, and non-local similarity priors), tensor methods, and neural network methods. In my opinion, two important points in the study of image restoration are a technical viewpoint about the behavior of mathematical models and optimization algorithms a realistic viewpoint for digging into the essential nature of images I hope this tutorial helps your study in the future. Basics Tensors Neural Nets