Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Transport Information Geometric Computations Ye...

Wuchen Li
August 20, 2023

Transport Information Geometric Computations Year 1

Wuchen Li

August 20, 2023
Tweet

More Decks by Wuchen Li

Other Decks in Research

Transcript

  1. Transport Information Geometric Computations Wuchen Li University of South Carolina,

    Columbia AFOSR Comp Math review meeting, Aug 10, 2023 Supported by AFOSR YIP award No. FA9550-23-1-0087 Starting from 05/01/2023.
  2. Current Collaborators UCLA: S. Osher; Osher’s group: S.T. Liu, S.

    Liu, M. Zhou, X. Zuo, T. Meng, Y. Liu, Y. Hong (visiting student); Q. Gu (Computer science); U of SC: L. Lu, W. Dahmen, H. Wang, Z. Wang; UH/Princeton: Z. Han, V. Poor; Brown/ND: C. Shu; G. Fu; U Mass-Amherst: M. Katsoulakis’ group, B. Zhang; Caltech: A. Stuart, F. Ho↵man, Y. Chen et.al.; UCI: T. Georgiou; UMN: L. Wang, W. Lee; Stanford: M. Pilanci, Y. Wang; Gatech: P. Chen; FSU/U-M: Q. Feng; B. Erhan; Duke/Purdue: J. Liu, Y. Gao; ...... 2
  3. Information Geometry Information matrix (a.k.a Fisher information matrix, Fisher–Rao metric)

    plays important roles in estimation, information science, statistics and machine learning: Statistics and information science: Likelihood principle; Cramer-Rao bound, etc. Statistical physics: Non-equilibrium thermodynamics (Sosuke Ito et.al.); Population games via replicator dynamics (Shahshahani, Smith): Applications in biology, economics, cancer, epidemic propagation etc; AI Inference problems: f–divergence, Amari ↵–divergence etc. Information geometry optimization methods; (KFAC, ADAM), with applications in AI: Natural gradient (Amari); ADAM (Kingma et.al. 2014); Stochastic relaxation (Malago) and many more in book Information geometry (Ay et.al.) 9
  4. Learning: Objective function Given a data measure pdata(x) = 1

    N PN i=1 Xi (x) and a parameterized model p(x, ✓). Most problems in AI or sampling refer to min p✓ 2p(⇥) DKL(pdata, p✓). One typical choice of DKL is the Kullback–Leibler divergence (relative entropy; Gibbs Free energy) DKL(pdata, p✓) = Z ⌦ pdata(x) log pdata(x) p(x, ✓) dx. 10
  5. Learning: Gradient direction The natural gradient method refers to ✓k+1

    = ✓k hGF (✓k) 1r✓DKL(pdata, p✓k ), where h > 0 is a stepsize and GF (✓) 2 R|⇥|⇥|⇥|, with GF (✓)ij =E X⇠p✓ h r✓i log p(X, ✓) · r✓j log p(X, ✓) i , being the Fisher information matrix. Why natural gradient and Fisher information matrix? Parameterization invariant; Pre-conditioners for KL divergence related learning problems; Fisher e cient; Online Cramer-Rao bound. 11
  6. Optimal transport In recent years, optimal transport (a.k.a Earth mover’s

    distance, Monge-Kantorovich problem, Wasserstein distance) has witnessed a lot of applications: Theory (Brenier, Gangbo, Villani, Figalli, et.al.); Gradient flows (Otto, Villani, Maas, Miekle, et.al.); Proximal methods (Jordan, Kinderlehrer, Otto et.al.); Physics: GENERICs/Pre-GENERICs (Grmela and Ottinger, Doung et.al.), Flux-gradient flows (Li, Liu, Osher); Mean field games (Larsy, Lions et.al.); Population games; Evolutionary game theory (Degond, Liu, Li, et.al.); Image processing (Rubner et.al. 2000, Tannenbaum, Georgiou, et.al.); Inverse problems (Stuart, Li, Osher, et.al.); Scientific computing: (Benamou, Brenier, Carillo, Oberman, Peyre, Osher, et. al.) AI and Machine learning: See NIPS, ICLR, ICML 2015– 2023. Minimal surfaces in Wasserstein spaces (Li, Georgiou); Wasserstein common noises (Li); 12
  7. Optimal transport density manifold Optimal transport has an optimal control

    reformulation by the dual of dual of linear programming: DW (p0, p1)2 = inf pt n Z 1 0 gW (@tpt, @tpt)dt = Z 1 0 Z ⌦ (r t, r t)ptdxdt o , where the infimum runs over all vector fields vt = r t , under the constraint of continuity equation: @tpt + r · (pt r t) = 0, p0 = p0, p1 = p1. Here, (P(⌦), gW ) forms an infinite-dimensional Riemannian manifold1. 1John D. La↵erty: The density manifold and configuration space quantization, 1988. 13
  8. Learning Wasserstein type gradient directions Goal: We work on the

    optimal transport induced geometries (information matrices) and dynamics in probability models. They are used to design mathematical safe: fast, e cient/scalable, accurate algorithms. Literature review Wasserstein Gradient Flow: Otto (2001); Natural Gradient works e ciently in learning: Amari (1998), ; Constrained gradient flows in Wasserstein spaces: Carlen, Gangbo (2003); 14
  9. TIG computations of complex systems Goals: Fast: Convexities in nonlinear

    optimization in probability spaces; Accurate: Approximation in nonlinear probability models; Scalable/E cient: Computation of nonlinear PDEs and controls in complex systems. 15
  10. TIG directions TIG geometric calculus: Continuous domain (La↵erty, Lott, Li),

    Discrete domain/Complex networks (Li); TIG convexity analysis: Generalized Ricci curvatures in interacting stochastic systems: Continuous domain (Li, Feng, Erhan), Discrete domain/Complex networks (Li, Lu); TIG statistical formulations (Li, Amari, Zhao, Matsuda, Fabio, et.al.); TIG AI scientific computing (Li, Osher, Liu, Lee, Hong, Wang, et.al.); TIG optimization methods (Li, Osher, Stuart, Franca, Lin, Wang, Pilanci, Chen, Gu, et.al.); TIG high order structure preserving computations (Li, Osher, Fu, et.al.); TIG modeling in social systems, pandemics, engineering, and more (Li, Osher, Liu, Lee, Han, Gao, Liu, et.al.); TIG inverse problems of complex systems (Li, Osher, S.T. Liu, S. Liu, X. Zhe, M. Zhou, et.al); TIG quantum mechanics (free probability) and computing (Li, Becker, Jekel, Shlyakhtenko). 16
  11. Part I: Transport information statistical formulations Main results Li, Transport

    information geometry: Riemannian calculus on probability simplex, 2018; Li, Zhao, Wasserstein information matrix, 2019; Li, Rubio, On a prior based on the Wasserstein information matrix, 2022; Amari, Matsuda, Wasserstein statistics in 1D location-scale model, March, 2020. Amari, Matsuda, Information Geometry of Wasserstein Statistics on Shapes and A ne Deformations, 2023. Review paper G. Khan, J. Zhang. When Optimal Transport Meets Information Geometry, 2022. 17
  12. Pull back=Computational schemes Probability models Finite volume; Finite element; Kernel

    methods; Gaussian; Exponential family; Mixed Gaussian; Generative models by deep neural network functions: Neural ODE (Explicit); Generative adversarial networks (Implicit). 18
  13. Information matrix Definition (Statistical information matrix) Consider the density manifold

    (P(X), g) with a metric tensor g, and a smoothly parametrized statistical model p✓ with parameter ✓ 2 ⇥ ⇢ Rd. Then the pull-back G of g onto the parameter space ⇥ is given by G(✓) = D r✓p✓, g(p✓)r✓p✓ E . Denote G(✓) = (G(✓)ij)1i,jd , then G(✓)ij = Z X @ @✓i p(x; ✓) ⇣ g(p✓) @ @✓j p ⌘ (x; ✓)dx. Here we name g the statistical metric, and call G the statistical information matrix. 19
  14. Score Definition (Score Function) Denote i : X ⇥ ⇥

    ! R, i = 1, ..., n satisfying i(x; ✓) =  g(p) ✓ @ @✓i p(x; ✓) ◆ . They are the score functions associated with the statistical information matrix G and are equivalent classes in C(X)/R. The representatives in the equivalent classes are determined by the following normalization condition: E p✓ i = 0, i = 1, ..., n. Then the statistical information matrix satisfies G(✓)ij = Z X i(x; ✓) ⇣ g(p✓) 1 j ⌘ (x; ✓)dx. 20
  15. Examples: Fisher information matrix Consider g(p) 1 = p. GF

    (✓)ij = E X⇠p✓ [ i(X; ✓) j(X; ✓)] . where k = 1 p✓ r✓k p✓, k = {i, j}. Notice the fact 1 p r✓k p = r✓k log p, then GF (✓)ij = E X⇠p✓  @ @✓i log p(X; ✓) @ @✓j log p(X; ✓) . In literature, GF (✓) is known as the Fisher information matrix and r✓ log p(X; ✓) is named score function. 21
  16. Examples: Wasserstein information matrix Consider g(p) 1 = r ·

    (pr): GW (✓)ij = E X⇠p✓ ⇥ rx W i (X; ✓), rx W j (X; ✓) ⇤ . where rx · (p✓ rx W k ) = r✓k p✓, k 2 {i, j}. Here we call GW (✓) Wasserstein information matrix and name W the Wasserstein score function. 22
  17. Distance/Divergence and information matrix Specifically, given a smooth family of

    probability densities p(x; ✓) and a given perturbation ✓ 2 T✓⇥, consider the following Taylor expansions in term of ✓: DKL(p(✓)kp(✓ + ✓)) = 1 2 ✓T GF (✓) ✓ + o(( ✓)2), and DW(p(✓ + ✓), p(✓))2 = ✓T GW (✓) ✓ + o(( ✓)2). 23
  18. PDEs connecting Wasserstein and Fisher Scores The Wasserstein score functions

    W i (x; ✓) satisfy the following Poisson equation rx log p(x; ✓) · rx W i (x; ✓) + x W i (x; ✓) = @ @✓i log p(x; ✓). 24
  19. One dimensional sample space If X ⇢ R1, the Wasserstein

    score functions satisfy W i (x; ✓) = Z 1 p(z; ✓) @ @✓i F(z; ✓)dz, where F(x; ✓) = R x p(y; ✓)dy is the cumulative distribution function. And the Wasserstein information matrix2 satisfies GW (✓)ij = E X⇠p✓ " @ @✓i F(X; ✓) @ @✓j F(X; ✓) p(X; ✓)2 # . The Wasserstein score function is the average of the cumulative Fisher score function, and the Wasserstein information matrix is the covariance of the density average of the cumulative Fisher score function. 2Chen, Li, Wasserstein natural gradient in continuous sample space, 2018. 25
  20. Separability If p(x; ✓) is an independence model, i.e. p(x,

    ✓) = ⇧n k=1 pk(xk; ✓), x = (x1, · · · , xn). Then there exists a set of one dimensional functions W,k : Xk ⇥ ⇥k ! R, such that W (x; ✓) = n X k=1 W,k(xk; ✓). In addition, the Wasserstein information matrix is separable: GW (✓) = n X k=1 Gk W (✓), where Gk W (✓) = E X⇠p✓ ⇥ rxk W,k(X; ✓), rxk W,k(X; ✓) ⇤ . 26
  21. Analytic examples Consider a one dimensional generative models: p(·; ✓)

    = f✓⇤p0 (·) , p0 a given distribution, GW (✓) = Z |r✓f✓ (x)|2 p0 (x) dx, For a single ReLU function: f✓ (x) = (x ✓) = ( 0, x  ✓, x ✓, x > ✓. GW (✓) = F0(✓), F0 cumulative distribution function of p0. Figure: This figure plots two example of the push-forward family we described above with parameter chosen as ✓1 = 3, ✓2 = 5. 27
  22. Statistical Information Matrix Probability Family Wasserstein information matrix Fisher information

    matrix Uniform: p(x;a,b) = 1 b a 1(a,b) (x) GW (a,b) = 1 3 ✓ 1 1 2 1 2 1 ◆ GF (a,b) not well-defined Gaussian: p(x;µ, ) = e 1 2 2 (x µ)2 p 2⇡ GW (µ, ) = ✓ 1 0 0 1 ◆ GF (µ, ) = ✓ 1 2 0 0 2 2 ◆ Exponential: p(x;m, ) = e (x m) GW (m, ) = ✓ 1 0 0 2 4 ◆ GF (m, ) not well-defined Laplacian: p(x;m, ) = 2 e |x m| GW (m, ) = ✓ 1 0 0 2 4 ◆ GF (m, ) = ✓ 2 0 0 1 2 ◆ Location-scale: p(x;m, ) = 1 p( x p ) GW ( ,m) = E ,mx2 2mE ,mx+m2 2 0 0 1 ! GF ( ,m) = 0 @ 1 2 ⇣ 1 + R R ⇣ (x m)2p02 2p + (x m)p0 ⌘ dx ⌘ R R (x m)p02 3p dx R R (x m)p02 3p dx 1 2 R R p02 p dx 1 A Independent: p(x,y;✓) = p(x;✓)p(y;✓) GW (x,y;✓) = G1 W (x;✓) + G2 W (y;✓) GF (x,y;✓) = G1 F (x;✓) + G2 F (y;✓) ReLU push-forward: p(x;✓) = f✓⇤p(x), f✓ ✓-parameterized ReLUs.. GW (✓) = F (✓), F cdf of p(x) GF (✓) not well-defined Table: In this table, we present Wasserstein, Fisher information matrices for various probability families. 28
  23. Part II: Transport information proximal methods in generative AI Lin,

    Li, Osher, et.al., Wasserstein proximal of GANs, GSI, 2018; Li, Lin, et.al., A ne proximal learning, GSI, 2019; Arbel, Gretton, Li, et.al., Kernelized Wasserstein Natural Gradient, ICLR, spotlight, 2020; Liu, Li, et.al., Neural parametric Fokker-Planck equations, SIAM journal on numerical analysis, 2020. 29
  24. Wasserstein natural gradient Given a Loss function F: P(⌦) !

    R and probability model p(·, ✓), the associated gradient flow on a Riemannian manifold is defined by d✓ dt = rgF(p(·, ✓)). Here rg is the Riemannian gradient operator satisfying g✓(rgF(p(·, ✓)), ⇠) = r✓F(p(·, ✓)) · ⇠ for any tangent vector ⇠ 2 T✓⇥, where r✓ represents the di↵erential operator (Euclidean gradient). 30
  25. Proximal method on a probability space Denote G(p) 1 =

    p . Consider the gradient flow: ˙ p = G(p) 1 pF(p) = r · (pr F(p)). The proximal method refers to pk+1 = arg min p2P F(p) + 1 2h dG(p, pk)2, where h > 0 is the step-size, and dG(x, y) is the distance function induced by G, known as the transport distance. This is known as the variational backward Euler method pk+1 = pk hr · (pk+1 r F(pk+1)) + o(h). A linearlization of proximal methods have been formulated by (Li, Lu, Wang, 2019). 31
  26. Generative Adversary Networks For each parameter ✓ 2 Rd and

    given neural network parameterized mapping function g✓ , consider p✓ = g✓#p(z). 32
  27. Natural proximal on probability models Consider the gradient flow: ˙

    ✓ = G(✓) 1r✓F(p(✓, ·)), where G(✓) = (r✓p, ( p) 1r✓p)L2 . The update scheme follows ✓k+1 = arg min ✓2⇥ F(p✓) + 1 2h DW(✓, ✓k)2. where ✓ is the parameters of the generator, F(p✓) is the loss function, and DW is the Wasserstein metric. In practice [Lin, Li, Osher, et.al. 2018], consider ✓k+1 = arg min ✓2⇥ F(p✓) + 1 B B X i 1 2h kg✓(zi) g✓k (zi)k2, where g✓ is the generator, B is the batch size, and zi ⇠ p(z) are inputs to the generator. 33
  28. Examples: Jensen–Shannon entropy Loss Figure: The relaxed Wasserstein Proximal of

    GANs, on the CIFA10 (left), CelebA (right) datasets. 34
  29. Part III: Transport information scientific computing and numerical analysis Lee,

    Wang, Li, Deep JKO: time-implicit particle methods for general nonlinear gradient flows, Ongoing. Zhao, Li, Scaling limit of Wasserstein Gaussian mixture models, preprint preparation. G. Fu, S. Osher, W.C. Li. High order spatial discretization for variational time implicit schemes: Wasserstein gradient flows and reaction-di↵usion systems, accepted in Journal of Computational Physics, July, 2023. 35
  30. Part IV: Transport information optimization in MCMC and Bayesian inverse

    problems A. Garbuno-Inigo, F. Ho↵mann, W.C. Li, A.M. Stuart. Interacting Langevin Di↵usions: Gradient Structure And Ensemble Kalman Sampler. SIAM Journal on applied dynamical system, 2019. Y.F. Wang, W.C. Li. Information Newton flow: second-order optimization method in probability space. Jan, 2020. Y.F. Wang, W.C. Li. Accelerated Information Gradient flow. Accepted in Journal of Scientific computing, Nov, 2021. Y.F. Wang, P. Chen, W.C. Li. Projected Wasserstein gradient descent for high-dimensional Bayesian inference. SIAM/ASA Journal on Uncertainty Quantification, 2022. Y.F. Wang, P. Chen, M. Pilanci, W.C. Li. Optimal Neural Network Approximation of Wasserstein Gradient Direction via Convex Optimization. May, 2022. 36
  31. Bayesian logistic regression Figure: Comparison of di↵erent methods on Bayesian

    logistic regression, averaged over 10 independent trials. The shaded areas show the variance over 10 trials. Top: BM Bottom: MED. Left: Test accuracy; Right: Test log-likelihood. 29
  32. C Complex Systems Bayesian Sampling Future works Time Discretization Spatial

    Discretization Entropy Entropy Flux Metric Optimization