the limit of infinite network width Contribution ・The result of classification of Cifar10 and MNIST by GP was better than NN ・Propose the way to find Kernel function of GP numerically The reasons why I read this paper ・GP overcome NN’s weakness ・We can use GP in pytorch and TF. Speed up by GPU Prolusion
relationship between NN and GP The contents of the paper ・The way to calculate Kernel function numerically ・Experimental result ・Phase transition related to hyper parameters of GP ・Conclusion Contents
be estimated will increase in high dimension RBFR GPR Avoid curse of dimensionality by integration with respect to = 1 ⋮ = 1 (1 ) ⋯ (1 ) ⋮ ⋱ ⋮ 1 ( ) ⋯ ( ) 1 ⋮ = w We introduce prior distribution ~N , λ2 In RBFR, the output of the any input is = − = = λ2 Use it later
follows gaussian = w = w = w = = − = = = = λ2 ~N(, λ2) We don’t have to know weight vector even if the dimension of is high. − = = λ2 D : gaussian process Any input set = 1 , ⋯ , , output = 1 , ⋯ , Joint distribution follow Gaussian = def The relation between and Follows Gaussian process Avoid curse of dimensionality by integration with respect to Because w follows gaussian,
= j=1 N −1 In → ∞, it is equivalent to GP , ′ = ′ = −1 −1 = 2 (−1 , ′ , −1 , ′ , −1 , ′ ) = 1 2 −1 , −1 ′, ′ sin ,′ −1 + − ,′ −1 cos ,′ −1 ,′ = cos−1 ′, ′ −1 , −1 ′, ′ [Cho & Saul. 2009] When is ReLU It is very difficult to get analytical formula because of CLT
+ proposed methods 2 + ( 2 + ) 1. = [− , ⋯ , ] = [0, ⋯ , ] ∈ ℝ each element placed at even intervals ∈ ℝ each element placed at even intervals < 2 = −1, ⋯ , 1 ∈ ℝ each element placed at even intervals Cost of finding Kernel function corresponding L layers NN 2. = σ, exp − 1 2 −1 σ , exp − 1 2 −1 3. Approximate the function by bilinear interpolation into the matrix
is MSE. No Dropout. Use Google vision hyper tuner to initialize weights and bias. How to classify by regression method? →Return 0.9 when right label, -0.1 when wrong label (expectation is 0) [Rifkin & Klautau 2004] Compare the results between SGD(Adams) and NNGPs in MNIST and Cifar-10
・No optimization ・Due to its Bayesian nature, all predictions have uncertainty estimates ・Only matrix calculation ・No overfitting The pros of increasing number of units in NN ・ generalization gap ( = test error – train error) become smaller
distribution overcome it ・Some researcher are trying to find GP which correspond to CNN or LSTM [2017 Mark van der Wilk] [2017 Maruan Al-Shedivat] ・The cost of implement GP regression is 3 . N is number of train data. →There are many ways to reduce cost. (cf instrumental variable method) ・There is a way to estimate correct Kernel function automatically by inputs data. [2011 Marc Peter Deisenroth]