Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Transport Optimization Methods in Bayesian Samp...

Wuchen Li
September 30, 2022

Transport Optimization Methods in Bayesian Sampling Problems

We present a class of information metric optimization methods for Bayesian inverse problems. Two examples of information metrics are studied, including the Fisher-Rao metric and the Wasserstein-2 metric. Focusing on the Wasserstein-2 metric, we introduce accelerated gradient and Newton flows. We also formulate their practical discrete-time algorithms in particles. Facing a curse of dimensionality problem, we further introduce a projected Wasserstein gradient method to handle it. Numerical experiments, including PDE-constrained Bayesian inference and parameter estimation in COVID-19 modeling, demonstrate the effectiveness of the proposed method. This is based on joint works with Yifei Wang and Peng Chen.

Wuchen Li

September 30, 2022
Tweet

More Decks by Wuchen Li

Other Decks in Research

Transcript

  1. Transport Optimization Methods in Bayesian Sampling Problems Wuchen Li University

    of South Carolina SIAM MDS 22, San Diego, Sep 30 Joint work with Peng Chen (Georgia tech) and Yifei Wang (Stanford). Partially supported by AFOSR MURI FA9550-18-1-0502 and NSF RTG: 2038080.
  2. Bayesian sampling problems Main problem Given a function f(x), x

    2 ⌦, can we generate samples from ⇡(x) / e f(x) , where ⇡ is a probability density function. Di culties I Fast, e cient, accurate algorithms. I High dimensional sampling: ⌦ ⇢ RN , N >> 3. 3
  3. 4

  4. A sampling algorithm Consider Langevin dynamics: dXt = rf(Xt)dt +

    p 2dBt, where Xt 2 Rd and Bt is the standard Brownian motion in Rd. Denote Xt ⇠ ⇢t . The evolution of ⇢t satisfies @t⇢t = r · (⇢t rf) + ⇢t. Its invariant distribution satisfies ⇢1(x) = ⇡ = 1 K e f(x) , K = Z e f(y) dy. The EulerMaruyama method of Langevin dynamics relates to the Langevin Monte Carlo methods (LMC). 5 Euler-Maruyama
  5. Optimization in probability space Nowadays, the sampling algorithm relates to

    the variational problem in probability density space P(⌦) = n ⇢ 2 F(⌦): Z ⌦ ⇢dx = 1, ⇢ 0. o . Consider the optimization problem: min ⇢2P(⌦) E(⇢), where E(⇢) measures the closeness between ⇢ and target distribution ⇡. 6
  6. Gradient descent=LMC methods Consider E(⇢) = Z ⇢(x) log ⇢(x)

    ⇡(x)dx. Observe that @t⇢t =r · (⇢t r ⇢ E(⇢t)) =r · (⇢t r(log ⇢t log ⇡ + 1)) = ⇢t r · (⇢t r log ⇡) = ⇢t + r · (⇢t rf), where we apply the fact that ⇢r log ⇢ = r⇢. In short, LMC method in probability density space can be viewed as a gradient descent of KL divergence w.r.t. optimal transport metric. 7
  7. First-order optimization algorithms Sampling methods turns out to be first-order

    gradient descent methods in probability space. It has many generalizations: I Fisher-Rao metric and natural gradient methods (Amari, Ny, Zhang, et.al.); I Wasserstein metric and Wasserstein gradient descent (Jordan, Kinderlehrer, Otto, Villani, Carillo, Slepcev, Liu, Lu, Gao, Ambrosio, Gigli, Savare, Gangbo, et.al.); I Stein metric and Stein variational gradient descent (Liu, Wang); I Wasserstein-Kalman metric and Ensemble Kalman sampling (Garbuno-Inigo, Ho↵man, Li, Stuart); I Wasserstein Newton’s metric and Wasserstein Newton’s flow (Wang, Li); I Projected Wasserstein metric and Project Wasserstein gradient flow (Wang, Chen, Li); I Entropy-Entropy flux-metric and Flux-gradient flows (Li, Liu, Osher, Shu). 8 (Wang, Chen, Pilanci, Li);
  8. Accelerated sampling dynamics Goal Design and compute sampling dynamics in

    information metric spaces? Questions I What is the “mean-field” accelerated Langevin dynamics? I What is the “mean-field” Newton Langevin dynamics? 9
  9. Review: Gradient flows I The methods for optimization problems in

    the Euclidean space have been well-studied. min x2Rn f(x). I The classical gradient descent method follows xk+1 = xk srf(xk). It can be formulated into a continuous time gradient flow, ˙ x = rf(x). 10
  10. Review: Accelerate Gradient flows I The Nesterov’s accelerated method is

    widely applied in accelerating the vanilla gradient descent: ¨ x + ↵t ˙ x + rf(x) = 0. I This is equivalent to a damped Hamiltonian system  ˙ x ˙ p +  0 ↵tp  0 I I 0  rxHE(x, p) rpHE(x, p) = 0, with the Hamiltonian H E(x, p) = 1 2 kpk2 2 + f(x), and initial values x(0) = x0 and p(0) = 0. 11
  11. Gradient flows I The gradient flow for E(⇢) in (P(⌦),

    g⇢) takes the form @t⇢t = G(⇢t) 1 E(⇢t) ⇢t . I The Fisher-Rao gradient flow is given by @t⇢t = ⇢t ✓ E ⇢t Z E ⇢t ⇢tdx ◆ . I The Wasserstein gradient flow writes @t⇢t = ( r · (⇢t r))( E ⇢t ) =r · (⇢t r E ⇢t ). 12
  12. Hamiltonian flows in probability space The Euler-Lagrange equation in the

    probability space can be formulated as a system of (⇢t, t), i.e., 8 > < > : @t⇢t G(⇢t) 1 t = 0, @t t + 1 2 ⇢t ✓Z tG(⇢t) 1 tdx ◆ + E ⇢t = 0. They are examples of dynamics in Mean field game, optimal transport and information geometry. 13
  13. Example: Fisher-Rao Hamiltonian flow I The Fisher-Rao Hamiltonian flow follows

    8 > > < > > : @t⇢t t⇢t + Z t⇢tdx = 0, @t t + 1 2 2 t ✓Z ⇢t tdx ◆ t + E ⇢t = 0. I The corresponding Hamiltonian is HF (⇢t, t) = 1 2 Z 2 t ⇢tdx ✓Z ⇢t tdx ◆2 ! + E(⇢t). 14
  14. Example: Wasserstein Hamiltonian flow I The Wasserstein Hamiltonian flow writes

    8 < : @t⇢t + r · (⇢t r t) = 0, @t t + 1 2 kr t k2 + E ⇢t = 0. I The corresponding Hamiltonian forms HW (⇢t, t) = 1 2 Z kr t k2 ⇢tdx + E(⇢t). 15
  15. Damped Hamiltonian flows in probability space Consider 8 > <

    > : @t⇢t G(⇢t) 1 t = 0, @t t + ↵t t + 1 2 ⇢t ✓Z tG(⇢t) 1 tdx ◆ + E ⇢t = 0, (AIG) with initial values ⇢0 = ⇢0 and 0 = 0. 16 AIG: Accelerate Information Gradient Flows
  16. Fisher-Rao AIG flow The Fisher-Rao AIG flow writes 8 >

    > < > > : @t⇢t t⇢t + ✓Z t⇢tdx ◆ ⇢t = 0, @t t + ↵t t + 1 2 2 t ✓Z ⇢t tdx ◆ t + E ⇢t = 0. (F-AIG) 17
  17. Wasserstein AIG flow The Wasserstein AIG flow (Carillo, 2020, Amir,

    et.al. 2019) satisfies 8 < : @t⇢t + r · (⇢t r t) = 0, @t t + ↵t t + 1 2 kr t k2 + E ⇢t = 0. (W-AIG) 18
  18. Covariance AIG We consider AIG dynamics in Gaussian distributions. Theorem

    Suppose that ⇢0 , ⇢ ⇤ 2 Nn 0 and their covariance matrices are ⌃0 and ⌃⇤ . E(⌃) evaluates the KL divergence from ⇢ to ⇢ ⇤ . Here (⌃t, St) satisfies ( ˙ ⌃t (St⌃t + ⌃tSt) = 0, ˙ St + ↵tSt + S 2 t + 2r⌃t E(⌃t) = 0, (W-AIG-G) with initial values ⌃0 = ⌃0 and S0 = 0. Especially, for any t 0, ⌃t is well-defined and stays positive definite. 19 Flows
  19. Convergence rate of W-AIG flows Theorem Suppose that E(⇢) satisfies

    Hess( ) for > 0. The solution ⇢t to (W-AIG) with ↵t = 2 p satisfies E(⇢t)  C0e p t = O(e p t). If E(⇢) only satisfies Hess(0), then the solution ⇢t to (W-AIG) with ↵t = 3/t satisfies E(⇢t)  C 0 0t 2 = O(t 2). Here the constants C0, C 0 0 only depend on ⇢0. 20
  20. Time scaling and Convergence results Convergence rate I A su

    cient condition for Hess( ) is that f(x) is -strongly convex. I The W-AIG flow improves the O(e 2 t) convergence rate of the gradient flow to O(e p t), especially when is close to 0. Convexities estimations I from Wasserstein Hessian operator (Otto, Villani et.al.)=Weak form of Bakry-Emery Gamma calculus. I Modified Fisher-Rao in optimal transport space =Weak form of Gamma calculus (Li, 2018). I Modified Fisher-Rao in generalized optimal transport space=Information Gamma calculus (Feng, Li, 2019-2022). 21 (Lott, Villani, Strum, et.al.) Generalized tenors from (Arnold, Carlen, Ju, Erb, Baudoin, Garofalo).
  21. W-AIG flows in the particle level Proposition Suppose that Xt

    ⇠ ⇢t and Vt = r t(Xt) are the position and the velocity of a particle at time t. Then, the di↵erential equation of the particle system corresponding to (W-AIG) writes ( dXt = Vtdt, dVt = ↵tVtdt rf(Xt)dt r log ⇢t(Xt)dt. 22
  22. Discrete-time algorithm for W-AIG flow For i = 1, 2

    . . . N, the update rule follows ( V i k+1 = ↵kV i k p ⌧(rf(X i k ) + ⇠k(X i k )), X i k+1 = X i k + p ⌧V i k+1. If E(⇢) is Hess( ), then ↵k = 1 p ⌧ 1+ p ⌧ ; if E(⇢) is Hess(0) or is unknown, then ↵k = k 1 k+2 . Here ⇠k(x) is an approximation of r log ⇢t(x). Di culties I The approximation of r log ⇢t(x). I The sti↵ness of AIG flows. 23
  23. Learning density from Brownian motion I Given the bandwidth h,

    {Xi k }N i=1 and a step size s, we can compute two particle systems Y i k (h) = X i k s⇠k(x; h), Z i k = X i k + p 2sB i , i = 1, . . . N, where Bi is the standard Brownian motion. I Minimize MMD(ˆ ⇢Y , ˆ ⇢Z), the maximum mean discrepancy between the empirical distribution ˆ ⇢Y (x) = PN i=1 (x Y i k (h)) and ˆ ⇢Z(x) = PN i=1 (x Zi k ). I Optimize over h, using the bandwidth hk 1 from the last iteration as the initialization. 24
  24. Adaptive restart in density space I The second equation in

    (W-AIG) is the Hamilton-Jacobi equation, which usually has the strong sti↵ness. We propose an adaptive restart technique to deal with this problem. Consider 'k = N X i=1 ⌦ V i k+1, rf(X i k ) + ⇠k(X i k ) ↵ , which can be viewed as discrete-time approximation of @tE(⇢t). I If 'k < 0, then we restart the algorithm with initial values Xi 0 = Xi k and V i 0 = 0. I This essentially keeps @tE(⇢t) negative along the trajectory. I The numerical results show that the adaptive restart accelerates and stabilizes the discrete-time algorithm. 25
  25. AIG Algorithm Algorithm 1 Particle implementation of Wasserstein AIG flow

    Require: initial positions {Xi 0 }N i=1 , step size ⌧, number of iteration L. 1: Set k = 0, V i 0 = 0, i = 1, . . . N. Set the bandwidth h0 by MED. 2: for l = 1, 2, . . . L do 3: Compute hl based on BM: hl = BM(hl 1 , {Xi k }N i=1 , p ⌧). 4: Calculate ⇠k (Xi k ) with the bandwidth hl . 5: Set ↵k based on whether E(⇢) is Hess( ) or Hess(0). Update V i k+1 = ↵k V i k p ⌧(rf(Xi k ) + ⇠k (Xi k )), Xi k+1 = Xi k + p ⌧V i k+1 . 6: if RESTART then 7: Compute 'k = P N i=1 ⌦ V i k+1 , rf(Xi k ) + ⇠k (Xi k ) ↵ . 8: If 'k < 0, set Xi 0 = Xi k and V i 0 = 0 and k = 0; otherwise set k = k+1. 9: else 10: Set k = k + 1. 11: end if 12: end for 26
  26. Gaussian examples: ODE level I The initial value is set

    to be ⌃0 = I. I The following figure presents the convergence of the KL divergence on two target distributions with small/large L. Figure: The acceleration of W-AIG flow and the e↵ect of adaptive restart (ODE level). The target density is a Gaussian distribution with zero mean on R100. Left: L = 1,  ⇡ 3.8 ⇥ 103. Right: = 1,  ⇡ 4.0 ⇥ 103. 27
  27. Gaussian examples: Particle level I The initial distribution of samples

    follows N(0, I) and the number of samples is N = 600. I For a particle system {Xi k }N i=1 , we record the KL divergence E(ˆ ⌃k) using the empirical covariance matrix ˆ ⌃k . Figure: The acceleration of W-AIG flow and the e↵ect of adaptive restart (particle level). The setting of the target density is identical to the previous figure. 28
  28. Bayesian logistic regression Figure: Comparison of di↵erent methods on Bayesian

    logistic regression, averaged over 10 independent trials. The shaded areas show the variance over 10 trials. Top: BM Bottom: MED. Left: Test accuracy; Right: Test log-likelihood. 29
  29. C Complex Systems Bayesian Sampling Future works Time Discretization Spatial

    Discretization Entropy Entropy Flux Metric Optimization