Deep Neural Networks as Gaussian Processes

Deep Neural Networks as Gaussian Processes ICLR 2018 Jaehoon Lee,
Yasaman Bahri, Roman Novak , Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein

Back ground ・ Fully-connected NN is equivalent to GP in
the limit of infinite network width Contribution ・The result of classification of Cifar10 and MNIST by GP was better than NN ・Propose the way to find Kernel function of GP numerically The reasons why I read this paper ・GP overcome NN’s weakness ・Ｗｅ can use GP in pytorch and TF. Speed up by GPU Prolusion

Back ground ・GP as extended linear regression ・Kernel Trick ・The
relationship between NN and GP The contents of the paper ・The way to calculate Kernel function numerically ・Experimental result ・Phase transition related to hyper parameters of GP ・Conclusion Contents

Linear regression Regression by linear combination of basis functions basis
function = 1 () ⋮ () weight vector = 1 , ⋯ , = = ෍ =1 ( ) = 1 (1 ) ⋯ (1 ) ⋮ ⋱ ⋮ 1 ( ) ⋯ ( ) ∈ ℝ×: design matrix = , | = 1 ⋯ = −1 The predicted value on any point （test） Estimate by training data set (train)

Weakness of Linear regression = 1, , 2 = 1,
, 2, 3, 4 = 1, , sin = ？？？？ We have to find proper basis functions manually

Radial basis function regression ℎ = exp − − ℎ
2 2 = − , −+1 , ⋯ , −1 , = weight vector = − , ⋯ , ∈ ℝ2+1 Use shifted gaussians as basis functions. Gaussian is expressive!

Radial basis function regression Number of basis：10 RBF is expressive
In n-th dimensional problem, 10 RBF are needed as basis = 1.0 10 dimensional weight vector have to be estimated. Curse of dimensionality

Derivation of gaussian process The number of parameters which should
be estimated will increase in high dimension RBFR GPR Avoid curse of dimensionality by integration with respect to = 1 ⋮ = 1 (1 ) ⋯ (1 ) ⋮ ⋱ ⋮ 1 ( ) ⋯ ( ) 1 ⋮ = w We introduce prior distribution ~N , λ2 In RBFR, the output of the any input is = − = = λ2 Use it later

Derivation of gaussian process GPR = w ~N 0, λ2
follows gaussian = w = w = w = = − = = = = λ2 ~N(, λ2) We don’t have to know weight vector even if the dimension of is high. − = = λ2 D ： gaussian process Any input set = 1 , ⋯ , , output = 1 , ⋯ , Joint distribution follow Gaussian = def The relation between and Follows Gaussian process Avoid curse of dimensionality by integration with respect to Because w follows gaussian,

Kernel trick ~N , λ2 ≡ N , ：kernel ,′
= λ (′ ) ≡ (, ′) basis functions = 1 () ⋮ () Inner product of basis functions Example：RBF Kernel ℎ = exp − − ℎ/ 2 2 = −2 , −2+1 , ⋯ , 2−1 , 2 (, ′) = λ ′ = ෍ =− ℎ ℎ ′ lim → ∞ = න −∞ ∞ 2exp − − ℎ 2 2 exp − ′ − ℎ 2 2 ℎ = 1 exp − 1 2 − ′ 2 1 = 2 2 2 = 22 （how much they look similar） Put RBFs whose center are h/H placed at even 1/H intervals in range ∈ −,

How to regression by GP? ~N , , ′ ≡
, ′ = 1 exp − 1 2 − ′ 2 How to predict unknow value ∗ corresponding to ∗ ? ∗ ~N , ∗ ∗ T k∗∗ ∗ = ∗, 1 , ⋯ , (∗, ) k∗∗ = ∗, ∗ (∗|∗, )~N ∗ T−, ∗∗ − ∗ T−∗

The result

Relationship between NN and GP 1 0 1 1 2
0 2 1 3 0 3 1 4 0 4 1 1 0 2 0 3 0 1 1 2 1 0 1 = 1 , ⋯ , 1 = 1 1, 1 1 ⋯ , 0 ~(0, /0) 1 ~(0, /1) 1 = ෍ j=1 N1 1 1() = ෍ j=1 N1 1 ෍ =1 0 Activation function In 0 → ∞ , 1 follows Gaussian Fully-connected NN is equivalent to GP in the limit of infinite network width [Neal 1994] [Williams 1997] because of central limit theorem 0 ∶ number of units in hidden layer

Relationship between NN and GP 1~GP(1, 1) 1 = 1
= 11() = 1 1() = 0 ~(0, /0) 1 ~(0, /1) 1, 1 = 1 (1) 1 , ′ = 1 1 ′ − 1 1 ′ = 1 1 ′ = ෍ =1 0 ෍ =1 0 ′ It is very difficult to get analytical formula When is ReLU 1 , ′ = 2 ′ sin + − cos When is Step function 1 , ′ = 2 ′ − [NIPS Kernel Methods for DL] = cos−1 ⋅ ′ ′

Expand to multi layer NN = ෍ j=1 N ()
= ෍ j=1 N −1 In → ∞, it is equivalent to GP , ′ = ′ = −1 −1 = 2 (−1 , ′ , −1 , ′ , −1 , ′ ) = 1 2 −1 , −1 ′, ′ sin ,′ −1 + − ,′ −1 cos ,′ −1 ,′ = cos−1 ′, ′ −1 , −1 ′, ′ [Cho & Saul. 2009] When is ReLU It is very difficult to get analytical formula because of CLT

How to find Kernel function numerically existing methods 2 2
+ proposed methods 2 + ( 2 + ) 1. = [− , ⋯ , ] = [0, ⋯ , ] ∈ ℝ each element placed at even intervals ∈ ℝ each element placed at even intervals < 2 = −1, ⋯ , 1 ∈ ℝ each element placed at even intervals Cost of finding Kernel function corresponding L layers NN 2. = σ, exp − 1 2 −1 σ , exp − 1 2 −1 3. Approximate the function by bilinear interpolation into the matrix

Numerical calculation of Kernel In case of ReLU compared to
analytical solution

Experimental detail Activation function is Relu or tanh. Loss function
is MSE. No Dropout. Use Google vision hyper tuner to initialize weights and bias. How to classify by regression method? →Return 0.9 when right label, -0.1 when wrong label （expectation is 0） [Rifkin & Klautau 2004] Compare the results between SGD(Adams) and NNGPs in MNIST and Cifar-10

Experimental Result

A comparison between NN versus GP The advantage of GP
・No optimization ・Due to its Bayesian nature, all predictions have uncertainty estimates ・Only matrix calculation ・No overfitting The pros of increasing number of units in NN ・ generalization gap ( = test error – train error) become smaller

Phase transition related to hyper parameters of GP Theory[Schoenholz 2007]
Experimental Result Phase transition occurs by variance of prior distribution of weights and bias, and

Discussion ・The disadvantage of GP is outlier → student t
distribution overcome it ・Some researcher are trying to find GP which correspond to CNN or LSTM [2017 Mark van der Wilk] [2017 Maruan Al-Shedivat] ・The cost of implement GP regression is 3 . N is number of train data. →There are many ways to reduce cost. (cf instrumental variable method) ・There is a way to estimate correct Kernel function automatically by inputs data. [2011 Marc Peter Deisenroth]

Deep Neural Networks as Gaussian Processes

Deep Neural Networks as Gaussian Processes

Kazu Ghalamkari

More Decks by Kazu Ghalamkari

Other Decks in Science

Featured

Transcript