Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mean field games with applications year 6

Wuchen Li
September 21, 2024
57

Mean field games with applications year 6

Here are recent developments of mean-field games with applications. In particular, we focus on the computation of score functions using regularized Wasserstein proximal operators.

Wuchen Li

September 21, 2024
Tweet

More Decks by Wuchen Li

Transcript

  1. Understand score-based generative models via lens of Wasserstein proximal operators

    Stanley Osher, Siting Liu University of California, Los Angeles Benjamin Zhang, Markos Katsoulakis University of Massachusetts, Amherst Wuchen Li University of South Carolina August 2024
  2. Fundamental mathematical nature of score-based generative models I A fundamental

    characterization of score-based generative models as Wasserstein proximal operators (WPO) of cross-entropy I Mean-field games build a bridge to mathematically equivalent alternative formulations of SGM I Yields explainable formulations of SGMs grounded in theories of information, optimal transport, manifold learning, and optimization I Uncovering mathematical structure of SGMs explains memorization, and informs practical models to generalize better; suggests new practical models with interpretable mathematically-informed structure that train faster with less data. 2
  3. Wasserstein proximal operator I The Wasserstein metric, also known as

    the optimal transport (OT) distance, is a distance function defined between probability distributions and is denoted as Wp(µ, ⌫). I For p = 2, ⌧ > 0, a given probability density µ , the Wasserstein proximal operator (WPO) of the some function V (x) is defined as: ⇢ := WProx⌧V (µ) := arg min q2P2(Rd) ⇢Z Rd V (x)q(x)dx + W2(µ, q)2 2⌧ I Set V (x) = log ⇡(x) of some reference distribution ⇡, the first term is the cross-entropy of ⇡ with respect to ⇢. ⇢ = arg min ⇢ Z Rd q log ⇡dx + W2(µ, q)2 2⌧ µ (source) ! ⇡ (target), redistribution + transport 3
  4. Regularized Wasserstein Proximal Operator I Following Benamou-Brenier formulation of OT,

    the dynamic formulation of the Wasserstein proximal operator is indeed a first order mean-field game system. I The regularization is done by adding viscosity ⇢ through the dynamic formulation of optimal transport: inf ⇢,v ⇢Z 1 0 Z ⌦ 1 2 ⇢(x, t)kv(x, t)k2dx dt + Z ⌦ V (x)⇢(x, 1)dx s.t. ⇢t + r · (⇢v(x, t)) = ⇢, ⇢(x, 0) = µ(x) I Regularized WPO: ⇢ := WProx⌧V, (µ) := arg min nR Rd V (x)q(x)dx + W2(µ,q)2 2⌧ o . I With regulation, we obtain a closed-form formula of ⇢, which allows fast computation.1 I The KKT condition of such PDE-constrained optimization problem gives a second order mean-field game system: ( @U @t + 1 2 |rU|2 = U, U(x, 1) = V (x) @⇢ @t r · (⇢rU) = ⇢, ⇢(x, 0) = µ(x) 1 Wuchen Li, Siting Liu, and Stanley Osher, “A kernel formula for regularized Wasserstein proximal operators.” Research in the Mathematical Sciences 4
  5. Deriving SGM from regularized WPO I Score-based generative models (SGM)

    first perturb data with an SDE, then reverse the SDE for sample generation, where reversing guided by score function s(x, t) = rx log p(x, t). – Forward SDE (data ! noise) dX(t) = dW(t) – Reverse SDE (noise ! data) dY (t) = 2sdt + dW(t) I Set V (x) = log ⇡(x), = 2/2 in regularized WPO, and apply Cole-Hopf transform with a time reparametrization U(x, t) = 2 log ⌘(x, T t): ( @⌘ @s = 2 2 ⌘, ⌘(y, 0) = ⇡(y) @⇢ @t + r · (⇢ 2r log ⌘) = 2 2 ⇢, ⇢(x, 0) = µ(x) I SGM can be summarized with the forward and inverse WPO: ⇡ ⇡ ˜ ⇡ = WProx 2T H, 2/2 (WProx 1 2T H, 2/2 (ˆ ⇡)), where ˆ ⇡ is the empirical distribution defined by samples {Xi }N i=1 ⇠ ⇡. 5
  6. Kernel representation for the score and memorization For initial condition

    ⌘(x, 0) = ⇡(x), the score function s(x, t) = r log ⌘(x, t) has the kernel representation formula s(x, t) = rU(x, t) = 2 (rG 2/2,T t ⇤ ⇡)(x) (G 2/2,T t ⇤ ⇡)(x) where G is the heat kernel. I Memorization: appealing to the forward and inverse Wasserstein proximal notation, using this empirical score score formula is equivalent applying the relation: ˆ ⇡ = WProx 2T H, 2/2 (WProx 1 2T H, 2/2 (ˆ ⇡)), That is, using the exact score function simply returns the training data. I Generalization: Consider a generalization of the empirical distribution by Gaussian kernels. ˆ ⇡✓ (x; {Zi }N i=1 ) = 1 N N X i=1 det ✓ (Zi) (2⇡)d/2 exp ✓ (x Zi)> ✓ (Zi)(x Zi) 2 ◆ , I Learn local covariance matrix ✓ (·) near each kernel center use neural networks. Learning the local precision matrices is manifold learning. 6
  7. A kernel model that generalizes I Enforce the terminal condition

    of HJ equation, which is equivalent to implicit score-matching. I Learning local covariance matrix is akin to manifold learning, which is something SGM has been empirically observed to do. 7
  8. Examples: Deconstructing SGM The 6D example highlights its e↵ectiveness in

    moderate dimensions, showcasing a 3D Swiss roll noisily embedded in 6D space. 9
  9. Example: learn the data manifold Example with 2 moon dataset.

    I Red ellipses denote local covariance matrices I Set of local covariance matrices define Riemannian metric, and therefore a manifold 10
  10. Takeaways I Faster training with less data due to mathematically-informed

    structure of the kernel model, resolving memorization. – Proper choice of kernel (solves HJB equation). – Manifold learning (terminal condition of HJB, proximal interpretation). I Requires no simulation of SDEs: Kernel model can be sampled from directly. I Formulation provides new ideas of implementations – New bespoke neural nets for score-based models for scalable implementations. – Tensors instead of neural networks in manifold learning. 11
  11. Tensor Train Based Sampling Algorithms for Approximating Regularized Wasserstein Proximal

    Operators Fuqun Han (UCLA), Stanley Osher (UCLA), Wuchen Li (USC) August 15, 2024 arXiv:2401.13125
  12. Introduction I Objective: Obtain samples from a known probability distribution

    ⇢⇤(x) = 1 Z exp( V (x)) . I The classical overdamped Langevin dynamics for sampling is dX(t) = rV (X(t)) dt + p 2/ dW(t) , where W(t) is the standard Wiener process (Brownian motion). I We consider the score function based particle evolution equation dX dt = rV (X) 1r log ⇢(t, X) . I To approximate the score function, we utilize the regularized Wasserstein proximal operator (RWPO) ⇢T = argmin q inf v,⇢ Z T 0 Z Rd 1 2 ||v(t, x)||2⇢(t, x) dx dt + Z Rd V (x)q(x) dx , @⇢ @t + r · (⇢v) = 1 ⇢, ⇢(0, x) = ⇢0(x), ⇢(T, x) = q(x). 2
  13. Kernel Formula to Approximate Score Function 1 I Introducing a

    Lagrangian multiplier, the RWPO is equivalent to 8 > < > : @t⇢ + rx · (⇢rx ) = 1 x⇢, @t + 1 2 ||rx ||2 = 1 x , ⇢(0, x) = ⇢0(x), (T, x) = V (x). which has a Fokker-Planck and a Hamilton-Jacobi equation. I With the Hopf-Cole transformation, the RWPO can be solved as ⇢T (x) = Z Rd exp ⇥ 2 V (x) + ||x y||2 2 2T ⇤ R Rd exp ⇥ 2 V (z) + ||z y||2 2 2T ⇤ dz ⇢0(y)dy, where ⇢T approximates the evolution of the Fokker-Planck equation @⇢ @t = r · (⇢rV ) + 1 ⇢, ⇢(0, x) = ⇢0(x). 1W. Li, S. Liu, S. Osher, A kernel formula for regularized Wasserstein proximal operators, 2023 3
  14. Algorithm I With the density function from the kernel formula,

    the score function at time tk + h is approximated by r log ⇢tk+h = r⇢tk+h ⇢tk+h . I The score function at time tk+h allows us to implement the semi-implicit Euler discretization to the score dynamic Xk+1 = Xk h(rV (Xk) + 1r log ⇢tk+h(Xk)). I The high-dimensional function in the kernel formula (density function and heat kernel) is approximated by tensor train f(x1, . . . , xd) ⇡ r1 X ↵1=1 · · · rd X ↵d=1 f1(x1, ↵1) · · · fd(↵d 1, xd). The computational complexity of evaluating the kernel formula is linear with respect to dimension d. 4
  15. Example I: Sampling with Non-Gaussian Distribution We consider a sampling

    task involving a bimodal distribution: ⇢⇤(x) = 1 Z exp( 2(||x|| a)2) ⇥ exp 2(x1 a)2 + exp 2(x1 + a)2 ⇤ . -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 TT-BRWP ULA Figure: Evolution of sample points for di↵erent algorithms after 30 iterations. The contour levels are 0.8, 0.4, and 0.1 in 20 iterations. Faster convergence to the stationary distribution. 5
  16. Example II: Solving Bayesian Inverse Problems Solving nonlinear elliptic inverse

    problems with noisy measurement data by sampling from a posterior distribution. 2 100 200 300 400 500 600 700 800 900 1000 Iterations 0.1 0.2 0.3 0.4 0.5 0.6 0.7 TT-BRWP BRWP MALA 5 10 15 20 25 30 35 40 45 50 55 60 iterations 0 0.2 0.4 0.6 0.8 1 1.2 TT-BRWP BRWP MALA Step size h = 10 3 Step size h = 0.1 Figure: L2 error in parameter estimation with di↵erent step sizes. (BRWP is the algorithm without tensor train integration.) More robust for larger step sizes. 2A. Stuart, Inverse problems: A Bayesian perspective, 2010. 6
  17. Ongoing Works and Future Directions I Convergence and mixing time

    for non-Gaussian log-concave distributions: Utilizing score dynamics enables faster convergence compared to di↵usion generated by Brownian motion. I Optimal parameter selection: Investigating the optimal choice of h and . The semi-backward Euler discretization permits larger and more e↵ective step sizes. I Applications: Exploring applications in areas such as group symmetric distributions, composite density functions, global optimization, and time-reversible di↵usion. 7
  18. A deep learning algorithm for mean field control problems via

    forward-backward score dynamics Mo Zhou, Stan Osher, and Wuchen Li 2024.8 https://arxiv.org/pdf/2401.09547
  19. Mean-field control problem Minimize the cost functional inf v Z

    T 0 Z Rd [L(t, x, v(t, x))⇢(t, x) + F(t, x, ⇢(t, x))] dx dt+ Z Rd V (x)⇢(T, x) dx subject to the Fokker-Planck equation @t⇢(t, x) + rx · (⇢(t, x)v(t, x)) = 1 x⇢(t, x), ⇢(0, x) = ⇢0(x). or stochastic dynamic dXt = v(t, Xt) dt + p 2/ dWt , X0 ⇠ ⇢0. 2
  20. HJB equation that characterize optimality Hamiltonian: Legendre transform of the

    Lagrangian H(t, x, p) = sup v2Rd v > p L(t, x, v). HJB equation: @t (t, x) + H(t, x, rx (t, x)) + 1 x (t, x) = @F @⇢ (t, x, ⇢(t, x)) Optimal velocity: v ⇤(t, x) = DpH(t, x, rx (t, x)) 3
  21. Probability flow: stochastic ! deterministic Capture the stochastic dynamic in

    a deterministic way FP equation: @t⇢(t, x) + rx · (⇢(t, x)v(t, x)) = 1 x⇢(t, x) @t⇢(t, x) + rx ·  ⇢(t, x) ✓ v(t, x) 1 rx log ⇢(t, x) ◆ = 0 stochastic dynamic: dXt = v(t, Xt) dt + p 2/ dWt probability flow: dxt = ✓ v(t, xt) 1 rx log ⇢(t, xt) ◆ dt If x0 follows the distribution ⇢0 , then the density for xt precisely corresponds to the solution of the FP equation. 4
  22. Forward-backward score system and deep learning implementation Ours: Forward-backward score

    ODE system 8 > > > > < > > > > : @txt = DpH(t, xt, zt) 1 rx log ⇢(t, xt), @tyt = L(t, Xt, DpH(t, xt, zt)) + ft 1 ht 1 z > t rx log ⇢(t, xt), x0 ⇠ ⇢0 , yT = V (xT ) where yt = (t, xt), zt = rx (t, xt) and ht = x (t, xt). Discretization error: O( t). Traditional: forward-backward SDE system 8 > < > : dXt = DpH(t, Xt, Zt) dt + p 2/ dWt, dYt = (L(t, Xt, DpH(t, Xt, Zt)) + f(t, Xt, ⇢(t, Xt))) dt + p 2/ Z > t dWt X0 ⇠ ⇢0 , YT = V (XT ). The unique solution to the FBSDE above is Yt = (t, Xt) and Zt = rx (t, Xt). Discretization error: O( p t). 5
  23. Numerical implementation Construct a neural network for (t, x), denoted

    by N (t, x) Loss function: match the dynamic yt and its neural network parametrization N (t, xt) E Z T 0 | N (t, xt) yt| 2 dt, 6
  24. Numerical results inf v Z T 0 Z Rd ✓

    1 2|v(t, x)| 2 + 1 2|x| 2 + log(⇢(t, x)) + ◆ ⇢(t, x) dx dt + Z Rd V (x)⇢(T, x) dx s.t. @t⇢(t, x) + rx · ⇢(t, x)rx (t, x) = x⇢(t, x) Figure: Left: training curve; middle: plot for function; right: density plot. 7
  25. Structured behavior and less error of score dynamic Figure: First

    row: score dynamic for particles. Second row: stochastic dynamic for particles. The score dynamic demonstrates a more structured behavior. errors (1d | 2d) rx r2 x rx r2 x score method 1.59% 0.52% 0.51% 1.82% 0.54% 0.53% BSDE method 2.77% 2.63% 2.36% 1.92% 2.32% 2.14% Future work: E ciently compute the score dynamic. 8
  26. A primal-dual algorithm for reaction-di↵usion equations Shu Liu, Siting Liu,

    Stanley Osher, Xinzhe Zuo, Wuchen Li August, 2024 arXiv links: https://arxiv.org/abs/2305.03945 https://arxiv.org/abs/2401.14602.
  27. Introduction I Goal We propose an easy-to-implement iterative method for

    resolving the time-implicit schemes arising in solving reaction-di↵usion (RD) type equations. I Our treatment – Formulate as a min-max saddle point problem; – Apply the primal-dual hybrid gradient (PDHG) method. – Suitable precondition matrices are applied to accelerate the convergence of algorithms. I The method is applicable to various discrete numerical schemes with high flexibility. I The method can e ciently produce numerical solutions with su cient accuracy. I Extension to adversarial training of Physics Informed Neural Networks (PINNs). E↵ectiveness is verified by numerical results. 2
  28. Formulations Reaction-di↵usion (RD) type equation @tu(x, t) = G(aLu(x, t)

    + bf(u(x, t))), u(·, 0) = u0. Periodic/Neumann boundary condition. • a, b 0, f(u) is nonlinear reaction term. • Allen-Cahn G = Id, L = , f(u) = u3 u; • Cahn-Hilliard G = L = , f(u) = u3 u. Suppose L, G are the space discretization of L, G. The time-implicit scheme Ut Ut 1 ht = G(aLUt + bf(Ut)), 1  t  Nt. reduces to the following time-accumulated nonlinear equation F(U) = 0, where F(U) = DU V + htG (aL U + bf(U)). Here U 2 RNt·Nx , F : RNtNx ! RNtNx , where Nx is the number of space discretization. D denotes time discretization matrix, V 2 RNtNx is constant vector carrying boundary and initial data. G = INt ⌦ G, L = INt ⌦ L. 3
  29. Methods Solving F(U) = 0 is equivalent to minimizing 1

    2✏ kF(U)k2. This is equivalent to solving the saddle point problem min U2RN max P 2RN {P>F(U) ✏ 2 kPk2}. But F(U) could be ill-conditioned. Need preconditioning! I Consider precondition matrix M, e.g. M = D + htG (aL + bf0(U)I). (U denotes the equilibrium state. ±1 for Allen-Cahn or Cahn-Hilliard equations.) I replace F(·) by b F(·) = M 1F(·). (M 1 can be handled with FFT or DCT.) Apply Primal-Dual Hybrid Gradient method to min U max P P> b F(U) ✏ 2 kPk2: Pk+1 = 1 1 + ✏⌧P (Pk + ⌧P (M 1F(Uk))); e Pk+1 = Pk+1 + !(Pk+1 Pk); Uk+1 = Uk ⌧U ((M 1DF(Uk))> e Pk+1). ⌧U , ⌧P are stepsizes of PDHG algorithm, ! is the extrapolation coe cient. 4
  30. Convergence of preconditioned PDHG Suppose f(·) is Lipschitz; Lh Gh

    = Gh Lh . Denote R(U) = f(U) f(U) f0(U)(U U). Assume ht , Nt satisfy I (Allen-Cahn type) Nt · ht < p 2 1 bLip(R) ; I (Cahn-Hilliard type) ht < 4( p 2 1)2a b2(Lip(R) ( p 2 1)c)2 + , Nt  j ( p 2 1) 2 p a/ht+bc bLip(R) k . Then we have: I There exists a unique solution U⇤ to F (U) = 0. I One can pick hyperparameters ⌧P , ⌧U , !, ✏, s.t. Uk converges linearly to U⇤, kUk U⇤k2 2  Constant · 2 + p 2 + 4 ! k+1 , where > 0 is independent of the space discretization. 40 60 80 100 120 140 160 180 200 220 240 260 N x 50 100 150 200 250 300 350 400 450 Iterations needed for convergence Iterations needed for convergence vs N x 6th Order VarCoeff Cahn-Hilliard Allen-Cahn 5
  31. A 6th order reaction-di↵usion equation We consider the 6th-order Cahn-Hilliard-type

    equation depicting pore formation in functionalized polymers [Gavish et al. 2012]. @u(x, t) @t = (✏2 0 W00(u) + ✏2 0 )(✏2 0 u W0(u)) on [0, 2⇡]2, u(·, 0) = u0. Set ✏ = 0.18 with initial condition u0 = 2esin x+sin y 2 + 2.2e sin x sin y 2 1. The equation is imposed with the periodic b.c. When we set up the precondition matrix M, we replace original G = h(✏2 0 h diag(W 00(U)) + ✏2 0 I) by e G = h(✏2 0 h W 00(¯ u)I + ✏2 0 I), where W 00(¯ u) = W 00(±1) = 2. (a) t = 0.0 (b) t = 0.1 (c) t = 2.0 (d) t = 20.0 Figure: Numerical solution at di↵erent time stages. 6
  32. Ongoing research: Apply the preconditioned PDHG method to adversarial training

    of PDEs I General partial di↵erential equation (PDE) Lu = f on ⌦ ⇢ Rd. I infu sup L(u, ) , hLu f, i ✏ 2 k k2. Parameterize u, by neural networks u✓, ⌘. I Natural Primal-Dual Gradient (NPDG) – PDHG with suitable preconditioning. I Comparison with classical methods on a 20-dimensional elliptic equation with variable coe cients. 7
  33. A primal-dual hybrid gradient method for solving optimal control problems

    and the corresponding Hamilton-Jacobi PDEs Tingwei Meng+, Siting Liu+, Wuchen Liú, Stanley Osher+ úDepartment of Mathematics, University of South Carolina +Department of Mathematics, UCLA August 4, 2024 Meng, Liu, Li, Osher PDHG control August 4, 2024 1 / 5
  34. Motivation Solving the optimal control problem min –(·) ;⁄ T

    t L(“(s), s, –(s))ds + g(“(T)): “(t) = x, ˙ “(s) = f (“(s), s, –(s))’s œ (t, T) < , (1) and the corresponding Hamilton-Jacobi (HJ) PDE Y ] [ ˆÏ(x, t) ˆt + sup –œRm {≠Èf (x, T ≠ t, –), Òx Ï(x, t)Í ≠ L(x, T ≠ t, –)} = 0, x œ , t œ [0, T], Ï(x, 0) = g(x), x œ . (2) Grid based method (e.g., Lax–Friedrichs, Engquist-Osher scheme) needs to satisfy the CFL condition in discretization. In this project, we provide an optimization method for solving certain HJ PDEs with a saddle point formulation and implicit time discretization. This allows us to choose a larger time stepsize and avoid CFL condition. In the literature, there are some optimization algorithms for solving HJ PDE [Darbon, Dower, Osher, Yegerov, . . . ]. Compared to these methods, the optimization algorithm proposed in this project can handle a more general Hamiltonian that depends on (x, t). Partial theoretical guarantee is provided. Meng, Liu, Li, Osher PDHG control August 4, 2024 2 / 5
  35. Continuous formula Solving the original problem (1) and (2) is

    equivalent to the following optimization problem min Ï satisfying (2) ≠c ⁄ Ï(x, T)dx, (3) where c œ R is a constant used to stabilize the method. Introducing the Lagrange multiplier fl, we get min Ï Ï(x,0)=g(x) max fl ⁄ T 0 ⁄ fl(x, t) 3 ˆÏ(x, t) ˆt + sup –œRm {≠Èfx,t (–), Òx Ï(x, t)Í ≠ Lx,t (–)} 4 dxdt ≠ c ⁄ Ï(x, T)dx. (4) Under certain conditions, this becomes a saddle point problem min Ï Ï(x,0)=g(x) max flØ0,– ⁄ T 0 ⁄ fl(x, t) 1 ˆÏ(x, t) ˆt ≠ Èfx,t (–(x, t)), Òx Ï(x, t)Í ≠ Lx,t (–(x, t)) 2 dxdt ≠ c ⁄ Ï(x, T)dx. (5) We discretize and then employ the PDHG method to solve this saddle point problem. We use implicit time discretization to avoid the CFL condition. Meng, Liu, Li, Osher PDHG control August 4, 2024 3 / 5
  36. Numerical results: a quadratic Hamiltonian We solve the optimal control

    problem with dynamics f (x, t, –)I = ≠(|xI ≠ 1|2 + 0.1)–I for I = 1, 2 and Lagrangian L(x, t, –) = 1 2 |–|2. Then, the Hamiltonian is H(x, t, p) = 1 2 q 2 I=1 (|xI ≠ 1|2 + 0.1)2 p 2 I . The terminal cost function is g(x) = q 2 I=1 sin fixI , over the spatial domain [0, 2]2 under periodic boundary conditions. Left: solution to the HJ PDE, right: optimal trajectories. (a) (x, y) ‘æ Ï(x, y, t) at di erent t (b) Optimal trajectories “ú Meng, Liu, Li, Osher PDHG control August 4, 2024 4 / 5
  37. Numerical results: a 1-homogeneous Hamiltonian We solve a 1D problem

    with the same f and g as last page. The Lagrangian is an indicator function of the unit ball in the ¸Œ space. The Hamiltonian is H(x, t, p) = (|x ≠ 1|2 + 0.1)|p| (1-homogeneous with respect to p). Top left: solution to HJ PDE; top right: feedback optimal control; bottom left: optimal trajectories; bottom right: optimal controls. Meng, Liu, Li, Osher PDHG control August 4, 2024 5 / 5