Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Natural Gradient Learning by Shun'ichi Amari, R...

Natural Gradient Learning by Shun'ichi Amari, RIKEN - RRS #1

Transcript

  1. Optimization Problem Parameter Cost (loss) function Minimize L 1 2

    ( , ,..., ) ( ) n L     θ θ * argmin ( ) L  θ θ
  2. What is gradient: steepest direction L 1 2 ( )

    ( , ,..., ) n L L L L            θ
  3. On‐line gradient learning Instantaneous loss function 2 ( , )

    1 ( , ) { *( ) ( , )} 2 ( ) E[ ( , )] y f x l x y x f x L l x       θ θ θ θ θ 1 ( , ) t t t t l x      θ θ θ
  4. Riemannian manifold     2 2 ij i

    j T ds d g d d d G d            j   i  d    ( ) ( ( ))] ij G g  θ θ
  5. Manifold of probability distributions Parameterized model Gaussian Discrete x :

    Probability simplex { ( , )} M p x  θ 2 2 2 ( , ) 1 ( ) ( , ) exp{ } 2 2 x p x          θ θ
  6.        2 2 1

    ; , ; , exp 2 2 x S p x p x                      Information Geometry ? Information Geometry ?     p x       ; S p x  θ Gaussian distributions ( , )    θ
  7. Manifold of Probability Distributions Manifold of Probability Distributions  

    1 2 3 1 2 3 1, 2,3 { ( )} , , 1 x p x p p p p p p      3 p 2 p 1 p p     ; M p x  
  8. Fisher information matrix and Riemannian metric Estimstion error ( )

    E[ log ( , ) log ( , )] T ij g p x p x    θ θ θ 1 1 E[ ] i i j G T        
  9. Minimizing cost in Riemannian manifold minimize 2 1 ( )

    { ( ) } | | 1 ( ) ( ) T T L L G G G L             θ a θ a a a a a a a θ θ
  10. Steepest Direction ---Natural Gradient Steepest Direction ---Natural Gradient  

     1 1 2 , , = n i j ij l l l l G l d d Gd G d d                             d ( ) l  ( , ; ) t t t t t l x y       
  11. Natural Gradient Learning    1 1 ( ,

    ) t t t t l x l G l          θ θ θ   ( , * ; ) t t t t t l x y        
  12. Information Geometry of MLP Information Geometry of MLP Adaptive Natural

    Gradient Learning :     1 1 1 1 1 1 1 T t t t t l G G G G f f G                      
  13. Independent Component Analysis 1 i ij j A x A

    s W W A      x s y x s A W y x observations: x(1), x(2), …, x(t) recover: s(1), s(2), …, s(t)
  14. mixture and unmixture of independent signals 2 x 1 s

    n s 2 s m x 1 x 1 n i ij j j x A s     x As
  15. x=As y=Wx : W= 1 A A: unknown matrix s:

    unknown (s): unknown observations: x(1), x(2), …, x(t) 1 1 2 2 independ ( ) ( ent distributi ) ( )... ( ) on [ ] 0 n n r r s r s r s E   s s   , l       y W W W cost function: degree of non-independence r
  16. Riemannian manifold     2 2 ij i

    j T ds d g d d d G d            j   i  d      W Euclid: G= I
  17. Space of Matrices : Lie group -1 d d 

    X WW      2 1 tr tr T T T T d d d d d l l         W X X W W W W W W W : dX I I d  X W d  W W non-holonomic basis 1 W 
  18. Natural Gradient Learning Algorithm      

      { } ; { } log T ij ik i i k jk k i i i i I W y y W d y q y dy                  W F F y y W y Wx  ( ) ( ) = y y
  19. Mathematical Neurons     i i y w

    x h        w x x y ( ) u  u
  20. Multilayer Perceptrons   i i y v n 

        w x           2 1 ; exp , 2 , i i p y c y f f v             x x x w x    1 2 ( , ,..., ) n x x x x  1 1 ( ,..., ; ,..., ) m m w w v v   1 w 2 w 1 v 2 v y x
  21. Geometry of singular model   y v n 

       w x W v | | 0 v  w
  22. Learning, Estimation, and Model Selection     

       gen 0 train emp : ; ; E D p y p y E D p y               x x x   gen gen train : dimension 2 d E d n d E E n   
  23. Flaws of MLP Flaws of MLP slow convergence : Plateau

    local minima Boosting and Bagging error
  24. Information Geometry of MLP Information Geometry of MLP Natural Gradient

    Learning : S. Amari ; H.Y. Park     1 1 1 1 1 1 1 T t t t t l G G G G f f G                      
  25. Reinforcement learning: Markov decision process : { } : {

    } : ( ) : { ( )} : [ ( )] t t t state S action A reward R r policy P value V E r        s a s,a a | s;θ s ,a
  26. Fisher information matrix Policy natural gradient E[ log ( )

    log ( )] T G      a | s;θ a | s;θ    1 1 ( , ) t t t t l x l G l          θ θ θ  