Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Information Gamma Calculus

Wuchen Li
July 30, 2023

Information Gamma Calculus

We study the convergence analysis for general degenerate and non-reversible stochastic differential equations (SDEs). We apply the Lyapunov method to analyze the Fokker-Planck equation, in which the Lyapunov functional is chosen as a weighted relative Fisher information functional. We derive a structure condition and formulate the Lyapunov constant explicitly. We prove the exponential convergence result for the probability density function towards its invariant distribution in the L1 distance. Two examples are presented: underdamped Langevin dynamics with variable diffusion matrices and three oscillator chain models with nearest-neighbor couplings.

Wuchen Li

July 30, 2023
Tweet

More Decks by Wuchen Li

Other Decks in Research

Transcript

  1. Information Gamma calculus: Convexity analysis for stochastic differential equations Wuchen

    Li University of South Carolina July 31. This is based on a joint work with Qi Feng (FSU). Supported by AFOSR YIP award. 1
  2. Sampling problems Main problem Given a function V : Ω

    → R, the problem is to sample from π(x) = 1 Z e−V (x), where x ∈ Ω is a discrete or continuous sampling (state) space, π is a probability density function, and Z is a normalization constant. Difficulties Fast, efficient, accurate algorithms; High dimensional sampling: Ω ⊂ RN , N >> 3. E.g., Image distributions Ω = R64∗64; Function V = − log π − log Z, is often unknown or with intractable formulations. For simplicity of talk, we assume V or π is analytical known. 5
  3. Stochastic differential equations Consider a stochastic differential equation (SDE) by

    ˙ Xt = b(Xt ) + √ 2a(Xt ) ˙ Bt , where (n, m) ∈ N, Xt ∈ Rn+m, b ∈ Rn+m is a drift vector function, a(Xt ) ∈ R(n+m)×n is a diffusion matrix function, and Bt ∈ Rn is a standard n dimensional Brownian motion. For a diffusion matrix a and drift vector field b, assume that it has an invariant measure of shape π, how can we analyze the convergence speed of stochastic dynamics towards its invariant distribution? This is crucial in designing AI type sampling algorithms. 6
  4. Example: Langevin dynamics We review a classical example. Consider a

    overdamped Langevin dynamics in Rn: ˙ Xt = −∇V (Xt ) + √ 2 ˙ Bt , where V is a given potential function. Let ρ(t, x) be the probability density function of Xt , which satisfies the following Fokker-Planck equation ∂t ρ(t, x) = ∇ · (ρ(t, x)∇V (x)) + ∆ρ(t, x). Here π(x) := 1 Z e−V (x), where Z = e−V (y)dy < ∞, is the invariant distribution of the SDE. 8
  5. Lyapunov methods To study the dynamical behavior of ρ, we

    apply a global Lyapunov functional: DKL (ρt π) = ρt (x)log ρt (x) π(x) dx. Along the Fokker-Planck equation, the first order dissipation satisfies d dt DKL (ρt π) = − ∇x log ρt (x) π(x) 2ρt dx. And the second order dissipation satisfies d2 dt2 DKL (ρt π) = 2 ∇2 xx log ρt π 2 F −∇2 xx log π(∇x log ρt π , ∇x log ρt π ) ρt dx, where · F is a matrix Frobenius norm. In literature, DKL is named the Kullback–Leibler divergence (relative entropy, also free energy in statistical physics community) and I = − d dt DKL is called the relative Fisher information functional. 9
  6. Lyapunov constant Suppose there exists a “Lyapunov constant” λ >

    0, such that −∇2 xx log π(x) λI. Then d2 dt2 DKL (ρt π) ≥ −2λ d dt DKL (ρt π). By integrating on the time variable, one can prove the exponential convergence by DKL (ρt π) ≤ e−2λtDKL (ρ0 π). 10
  7. Literature There are several equivalent approaches to establish the Lyapunov

    constant for gradient dynamics. Log-Sobolev inequality (Gross); Iterative Gamma calculus (Bakry, Emery, Baudoin, Garofalo et.al.); Entropy dissipation and Hypocoercivity (Arnold, Carlen, Carrilo, Villani, Mohout, Jungel, Markowich, Toscani, et.al.); Optimal transport, displacement convexity and Hessian operators in density space. (McCann, Ambrosio, Villani, Otto, Gangbo et.al.); Transport Lyapunov functional (Renesse, Strum et.al.). 11
  8. Convergence analysis Recall that ˙ Xt = b(Xt ) +

    √ 2a(Xt ) ˙ Bt , where b can be a non-gradient drift vector, and a is a degenerate matrix. Its Fokker-Planck equation satisfies ∂t ρ(t, x) = −∇ · (ρ(t, x)b(x)) + n+m i=1 n+m j=1 ∂2 ∂xi ∂xj (a(x)a(x)T)ij ρ(t, x) . Assume that there exists an invariant distribution π ∈ C∞(Rn+m) with an explicit formulation. How fast does ρ converge to the invariant distribution π? 12
  9. Goals In this talk, we mainly consider the entropy dissipation

    analysis for non-gradient and degenerate stochastic dynamical systems. Main difficulties. Degeneracy of diffusion matrix; Non-gradient drift vectors. Our analysis is based on the construction of auxiliary second order operators in generalized optimal transport spaces. 13
  10. Optimal transport distances The optimal transport has a variational formulation

    (Benamou-Brenier 2000): D(ρ0, ρ1)2 := inf v 1 0 EXt∼ρt v(t, Xt ) 2 dt, where E is the expectation operator and the infimum runs over all vector fields vt , such that ˙ Xt = v(t, Xt ), X0 ∼ ρ0, X1 ∼ ρ1. Under this metric, the probability set has a metric structure1. 1John D. Lafferty: the density manifold and configuration space quantization, 1988. 14
  11. Optimal transport metrics Informally speaking, the optimal transport metric refers

    to the following bilinear form: ˙ ρ1 , G(ρ) ˙ ρ2 = ( ˙ ρ1 , (−∇ · (ρ∇))−1 ˙ ρ2 )dx. In other words, denote ˙ ρi = −∇ · (ρ∇φi ), i = 1, 2, then φ1 , G(ρ)−1φ2 = (φ1 , −∇ · (ρ∇)φ2 )dx = (∇φ1 , ∇φ2 )ρdx, where ρ ∈ P(Ω), ˙ ρi is the tangent vector in P(Ω), i.e. ˙ ρi dx = 0, and φi ∈ C∞(Ω) are cotangent vectors in P(Ω) at the point ρ. 15
  12. Optimal transport first and second order calculus Given a metric

    space (P, G), there is an optimal transport type calculus method, namely the first order operator (gradient) and the second order operator (Hessian) in the probability density space. They belong to transport information geometric analysis (TIGA). See [Li, TIG, 2018.] Shortly, we design a TIGA method to study non-gradient drift and degenerate diffusion coefficient type stochastic dynamical systems. 16
  13. Optimal transport gradient operators The Wasserstein gradient flow of an

    energy functional F(ρ) leads to ∂t ρ = − G(ρ)−1 δ δρ F(ρ) = ∇ · (ρ∇ δ δρ F(ρ)), where δ δρ is the L2 first variation operator. Example If F(ρ) = DKL (ρ π) = ρ(x)log ρ(x) π(x) dx, then the gradient flow satisfies the gradient-drift Fokker-Planck equation ∂ρ ∂t =∇ · (ρ∇log ρ π ) =∇ · (ρ∇ log ρ) − ∇ · (ρ∇ log π) =∆ρ + ∇ · (ρ∇V ). Here the trick is that ρ∇ log ρ = ∇ρ, ∇ log π = −∇V. 17
  14. First and second order calculus In this way, one can

    study the first order entropy dissipation by d dt DKL (ρt π) = log ρt π ∇ · (ρ∇log ρt π )dx = − ∇log ρt π 2ρdx. Similarly, we study the second order entropy dissipation by d2 dt2 DKL (ρt π) = − 2 Ω HessW DKL (ρt π)(log ρt π , log ρt π )ρt dx = − 2 Ω Γ2 (log ρt π , log ρt π )ρt dx, where HessW DKL (ρt |π) := Γ2 is a bilinear form, which can be defined by the optimal transport second order operator. 18
  15. Convergence analysis of non-gradient flows? Consider a general Fokker-Planck equation

    as ∂t ρ = −∇ · (ρb) + n+m i=1 n+m j=1 ∂2 ∂xi ∂xj (a(x)a(x)T)ij ρ . Suppose there exists a smooth invariant distribution π, such that −∇ · (πb) + n+m i=1 n+m j=1 ∂2 ∂xi ∂xj (a(x)a(x)T)ij π = 0. To study the convergence behavior of ρ towards π, we need to understand and estimate optimal transport type operators on both gradient and non-gradient directions. 19
  16. Decomposition: Gradient and Non-gradient (flux) One can rewrite the Fokker-Planck

    equation by ∂t ρ(t, x) = ∇ · (ρ(t, x)a(x)a(x)T∇ log ρ(t, x) π(x) ) + ∇ · (ρ(t, x)γ(x)), Gradient Non-gradient (Flux) where γ(x) :=a(x)a(x)T∇ log π(x) − b(x) + n+m j=1 ∂ ∂xj (a(x)a(x)T)ij 1≤i≤n+m , and ∇ · (π(x)γ(x)) = 0. This is an example of generalized entropy-entropy flux-metric condition. See examples in [Li-Liu-Osher, Controlling conservation laws I, II], and GENERIC or pre-GENERIC systems [Grmela and Ottinger]. 20
  17. Main result: Structure condition Assumption: for any i ∈ {1,

    · · · , n} and k ∈ {1, · · · , m}, we assume zT k ∇aT i ∈ Span{aT 1 , · · · , aT n }. Examples a is a constant vector; a is a matrix function defined by a = a(x1 , · · · , xn ), z ∈ span{en+1 , · · · , en+m }, where ei is the i-th Euclidean basis function. 21
  18. Main result: Entropy dissipation [F. and Li, 2020-2022] Under the

    assumption, for any β ∈ R and a given vector function z, define matrix functions as R = Ra + Rz + Rπ − MΛ + βRIa + (1 − β)Rγa + Rγz . If there exists a constant λ > 0, such that R λ(aaT + zzT), then the following decay results hold: DKL (ρt π) ≤ 1 2λ e−2λtIa,z (ρ0 π), where ρt is the solution of Fokker-Planck equation. 22
  19. Comparisons (i) If γ = 0 and m = 0:

    [Bakry-Emery, 1985]. (ii) If γ = 0 and m = 0: [Baudoin-Garofalo, 2017], [F.-Li, 2019]. (iii) If β = 0 and m = 0, [Arnold-Carlen-Ju, 2000, 2008]. (iv) If a, z are constants and β = 0, [Arnold-Erb, 2014][Baudoin-Gordina-Herzog, 2019]. (v) If β = 1, m = 0 and a = I, [Arnold-Carlen]; [F.-Li, 2020]. 24
  20. Main idea of proof Step 1: We first compute the

    dissipation of the weighted Fisher information functional (a.k.a. gradient norms in generalized OT space). d dt Ia,z (ρt π) = −2 Γ2,a,z,γ (f, f)ρt dx, where f = log ρt π . Step 2: We next decompose the weak form of information Gamma calculus (a.k.a. Hessian matrices in generalized OT space). Γ2,a,z,γ (f, f)ρt dx = Hessβ f 2 F + R(∇f, ∇f) ρt dx. Step 3: From the information Bochner’s formula, we establish the convergence result. In other words, if R λ(aaT + zzT), then d dt Ia,z (ρt π) ≤ −2λIa,z (ρt π). From Gronwall’s inequality, we can prove that the weighted Fisher information decays, so as the KL divergence decay. 25
  21. Details of proof Define Ia,z (ρ π) = Rn+m ∇

    log ρ π , (aaT + zzT)∇ log ρ π ρdx. Consider − 1 2 d dt Ia,z (ρt ) = Γ2 (f, f)ρt dxdx · · · (I) + Γz,π 2 (f, f)ρt dx · · · (II) + ΓIa,z (f, f)ρt dx · · · (III) where f = log ρ π , and Γ2 , Γz 2 , Γγ are designed bilinear forms, coming from the second order calculus in density space. (i) If a is non-degenerate, then (II) = 0; (ii) If b is a gradient vector field, then (III) = 0. 26
  22. Detailed approach For any f ∈ C∞(Rn+m), the generator of

    Itˆ o SDE satisfies Lf = Lf − γ, ∇f , where Lf = ∇ · (aaT∇f) + aaT∇ log π, ∇f . For a given matrix function a ∈ R(n+m)×n, we construct a matrix function z ∈ R(n+m)×m, and define a z-direction generator by Lz f = ∇ · (zzT∇f) + zzT∇ log π, ∇f . 27
  23. Global computation=Information gamma operators Define Gamma one bilinear forms by

    Γ1 (f, f) = aT∇f, aT∇f Rn , Γz 1 (f, f) = zT∇f, zT∇f Rm . Define Gamma two bilinear forms by (i) Gamma two operator: Γ2 (f, f) = 1 2 LΓ1 (f, f) − Γ1 (Lf, f). (ii) Generalized Gamma z operator: Γz,π 2 (f, f) = 1 2 LΓz 1 (f, f) − Γz 1 (Lf, f) + divπ z Γ1,∇(aaT) (f, f) − divπ a Γ1,∇(zzT) (f, f) . (iii) Irreversible Gamma operator: ΓIa,z (f, f) = (Lf + Lz f) ∇f, γ − 1 2 ∇ Γ1 (f, f) + Γz 1 (f, f) , γ . 28
  24. Local calculation=Information Bochner’s formula For any f = log ρt

    π ∈ C∞(Rn+m, R) and any β ∈ R, under the assumption, we derive − 1 2 d dt Ia,z (ρt π) = Γ2 (f, f) + Γz,π 2 (f, f) + ΓIa,z (f, f) ρt dx = Hessβ f 2 + R(∇f, ∇f) ρt dx ≥ R(∇f, ∇f)ρt dx. Clearly, if R λ(aaT + zzT), then − 1 2 d dt Ia,z (ρt π) ≥ λIa,z (ρt π). From Grownwall’s inequality, we prove the Lyapunov convergence result. 29
  25. Example I: Non-degenerate Langevin dynamics Consider a stochastic process: dXt

    = − a(Xt )a(Xt )T∇V (Xt ) + n j=1 ∂ ∂Xj (a(Xt )a(Xt )T)ij 1≤i≤n − γ(Xt ) dt + √ 2a(Xt )dBt , where Xt ∈ Rn, Bt is a standard n dimensional Brownian motion, V ∈ C2(Rn; R) is a potential function with ∇ · (e−V (x)γ(x)) = 0, and the diffusion matrix a is a smooth diagonal matrix with a(x) = diag(aii (xi ))1≤i≤n . Its invariant density satisfies π(x) = 1 Z e−V (x), Z = e−V (y)dy. 30
  26. Example I: Hessian matrix The Hessian matrix R has the

    following form: R(x) = Ra + βRIa + (1 − β)Rγa (x), where                    Ra,ii = a3 ii ∂xi aii ∂xi V + a4 ii ∂2 xixi V − a3 ii ∂2 xixi aii , Ra,ij = a2 ii a2 jj ∂2 xixj V, RIa,ii = γi [aii ∂xi aii − (aii )2∂xi V ], RIa,ij = 1 2 [γj (2aii ∂xi aii − (aii )2∂xi V ) + γi (2ajj ∂xj ajj − (ajj )2∂xj V )], Rγa,ii = γi aii ∂xi aii − ∂xi γi (aii )2, Rγa,ij = −1 2 [∂xi γj (ajj )2 + ∂xj γi (aii )2]. 31
  27. Example I: One dimensional case Let n = 1 and

    m = 0. R = a3a V + a4V − a3a + βγ(aa − a2V ) + (1 − β)(γaa − a2γ ), where and represent the first and second-order derivatives w.r.t. x. If a = 1 and γ = 0, then R = V (x); If a = 1 and γ = 0, β = 0, then R = V (x) − γ (x); If a = 1 and γ = 0, β = 1, then R = V (x) − γ(x)V (x); If a = 1 and γ = 0, β ∈ R, then R = V (x) − βγ(x)V (x) − (1 − β)γ (x). 32
  28. Example II: Underdamped Langevin dynamics Consider a underdamped Langevin dynamic

    by dxt =vt dt dvt =(−T(xt )vt − ∇x U(xt ))dt + 2T(xt )dBt . It can be viewed as, Yt = (xt , vt ), dYt = b(Yt )dt + √ 2a(Yt )dBt , with drift vectors and diffusion coefficient satisfying b = v −T(x)v − ∇U(x) , a = 0 T(x) . Its invariant measure has a closed form, π(x, v) = 1 Z e−H(x,v), H(x, v) = v 2 2 + U(x). 33
  29. Example II: Decomposition Write a vector field γ ∈ R2

    as γ(x, v) = 0 −rv − v −rv − ∇x U = J∇x,v H(x, v), where J = 0 −1 1 0 . Clearly, ∇x,v · π(x, v)γ(x, v) = 1 Z ∇x,v · e−H(x,v) J∇x,v H(x, v) = 0. 34
  30. II: Constant diffusion -1 -0.5 0 0.5 1 x 1

    -1 -0.5 0 0.5 1 x 2 -1 -0.5 0 0.5 1 1.5 Smallest eignvalue: 0.094 0.095 1 1 0.096 0.097 Smallest eignvalue: 0.098 0.099 0.5 0.5 0.1 x 2 x 1 0 0 -0.5 -0.5 -1 -1 Figure: T=1, U(x) = x2/2; Left β = 0 [Arnold-Erb]; Right β = 0.1; z = (1, 0.1)T. 36
  31. II: Variable diffusion -0.2 -0.15 1 1 -0.1 -0.05 Smallest

    eignvalue: 0.9 0.9 0 0.05 0.8 0.8 x 1 x 2 0.7 0.7 0.6 0.6 0.5 0.5 -0.04 -0.02 1 1 0 0.02 0.04 Smallest eignvalue: 0.06 0.9 0.9 0.08 0.1 0.8 0.8 x 1 x 2 0.7 0.7 0.6 0.6 0.5 0.5 Figure: U(x) = xc−x c(c−1) , T(x) = (∇2 x U(x))−1, c=2.5, z = 1 0.1 . (Left: β = 0; Right: β = 0.6.) 37
  32. Example III: SDEs on Lie groups There are some other

    convergence analysis for SDEs in Lie groups, including the Heisenberg group (quantum SDEs), the displacement group, and the Martinet flat sub-Riemannian structure. Consider dXt = − a(Xt )a(Xt )T∇V (Xt ) + n j=1 ∂ ∂Xj (a(Xt )a(Xt )T)ij 1≤i≤n dt + √ 2a(Xt )dBt , where a(x) ∈ C∞(Rn+m; R(n+m)×n) and π ∈ C∞(Rn+m; R). They are gradient flow examples with degenerate diffusion matrix coefficients. We still manage to derive suitable Hessian matrix conditions. See details in F-L, Entropy dissipation for degenerate stochastic differential equations via sub-Riemannian density manifold, 88 pages, Entropy, 2023. 38
  33. Example III.1: Heisenberg group If R λ(aaT + zzT), then

    DKL (ρt π) ≤ 1 2λ e−2λtIa,z (ρ0 π), where ρt is the solution of Fokker-Planck equation. 39
  34. Example III.2: Displacement group If R λ(aaT + zzT), then

    DKL (ρt π) ≤ 1 2λ e−2λtIa,z (ρ0 π), where ρt is the solution of Fokker-Planck equation. 40
  35. Example IV: Mean field underdamped Langevin dynamics There are some

    other Hessian matrix examples for mean field SDEs, whose Kolmogorov forward equation satisfies the weakly self-consistent Vlasov-Fokker-Planck equation. See analysis of overdamped case in [Carrillo-McCann-Villani, 2003]. Consider dxt = vt dt dvt = −vt dt − ( Td×Rd ∇x W(xt , y)f(t, y, v)dvdy + ∇x U(xt ))dt + √ 2dBt , where (xt , vt ) ∈ Ω = Td × Rd presents an identical particle’s position and velocity, Td is a d dimensional torus representing a position domain, and Bt is a standard Brownian motion in Rd. Here f is the probability density of both spatial and velocity variables (x, v). ∂t f + v · ∇x f − ( Td×Rd ∇x W(x, y)f(t, y, v)dvdy + ∇x U(x)) · ∇v f = ∇v · (fv) + ∇v · (∇v f). 41
  36. Example IV: Decomposition Similarly, we rewrite the above weakly self-consistent

    Vlasov-Fokker-Planck equation in the following equivalent form: ∂t f = ∇x,v · (fγ) + ∇x,v · (faaT∇x,v δ δf E(f)), where E(f) = Ω f(x, v) log f(x, v)dxdv + Ω 1 2 v 2f(x, v)dxdv + 1 2 Ω×Ω W(x, y)f(x, v)f(y, v)dxdvdydv + Ω U(x)f(x, v)dxdv, with γ = J∇x,v [ δ δf E(f)], and J = 0 −Id Id 0 2d×2d . 42
  37. Example IV We apply similar method in this paper and

    derive the Hessian matrix condition for the mean field PDE. See details about choices of W and z with convergence rate λ in B-F-Li, Exponential Entropy dissipation for weakly self-consistent Vlasov-Fokker-Planck equations, arXiv:2204.12049, 2022. 45
  38. TIGA for Stochastic Analysis & Algorithms Optimal choices of auxiliary

    directions for capturing sharp convergence rates; Design fast and scalable AI type sampling algorithms with robust convergence guarantees; Flux-gradient flow functional inequalities; Convergence analysis of generalized flux-gradient flows and Generalized mean field information Gamma calculus. 46