Wasserstein gradient flow of Moreau envelopes of f-divergences in reproducing kernel Hilbert spaces - MML'25 version

Wasserstein Gradient Flows of Moreau Envelopes of f-Divergences in Reproducing
Kernel Hilbert Spaces Viktor Stein, TU Berlin joint work with Sebastian Neumayer, TU Chemnitz Nicolaj Rux, TU Chemnitz / Berlin Gabriele Steidl, TU Berlin Conference on Mathematics of Machine Learning 2025 TU Hamburg 22.09.2025

Goal. Recover ν ∈ P(Rd) from samples by minimizing f-divergence
Df,ν to ν, e.g. KL(· | ν) or χ2(·, ν). Problem. Only samples of ν ⇝ empirical measures but: µ ̸≪ ν =⇒ Df,ν (µ) = ∞. weak convergence Our Solution. Regularize Df,ν : M(Rd) → [0, ∞]. pointwise convergence “Df,ν ◦ m−1” = Gf,ν : HK → [0, ∞] λGf,ν m(µ) = min σ∈M(Rd) Df,ν (σ) + 1 2λ ∥m(σ) − m(µ)∥2 HK , λ > 0. 1. “Kernel trick” m: M(Rd) → HK , µ → Rd K(x, ·) dµ(x) 2. Moreau envelope regularization We prove existence & uniqueness of W2 gradient flows of (λGf,ν ) ◦ m. Simulate particle flows = W2 gradient flows starting at empirical measure

Concurrent and prior work • KALE functional = MMD-regularized KL
divergence [Glaser, Arbel, Gretton. NeurIPS’21] No Moreau envelope interpretation, builds on [Nguyen et al., 2007]. • Kernel methods of moments = f-divergence-regularized MMD [Kremer, Nemmour, Schölkopf, Zhu. ICML’23] No gradient flow, added constraints w.r.t parameter. • (f, Γ)-divergence = Pasch-Hausdorff envelope of f-divergences [Birrell, Dupuis, Katsoulakis, Pantazis, Rey-Bellet, JMLR’23] Only Lipschitz, not differentiable functional. • W1 -Moreau envelope of f- divergences [Terjék. ICML’21] No RKHS making optimization finite-dim., hence tractable. • (De)-regularized MMD Gradient Flow [Chen et al. arXiv 09/24] MMD-regularized χ2-div. Asymptotic geodesic convexity. Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 3 / 19

1. RKHS & MMD 2. Moreau envelopes 3. f-divergences 4.
Moreau envelopes of f- divergences 5. Wasserstein gradient flow (WGF) 6. WGF of Moreau envelopes of f-divergences

Kernel mean embedding and Maximum Mean Discrepancy K : Rd
× Rd → R symmetric, positive definite ↭ reproducing kernel Hilbert space (RKHS) HK [Aronszajn, 1950, Steinwart and Christmann, 2008]. We only consider radial kernels K(x, y) = φ(∥x − y∥2 2 ). Kernel mean embedding (“kernel trick for signed measures” ∈ M(Rd) instead of points ∈ Rd) m: M(Rd) → HK , µ → Rd K(x, ·) dµ(x). HK Rd M(Rd) ⟲ x → K(x, ·) x → δx m ⇝ Instead of measures, compare their embeddings in HK : maximum mean discrepancy (MMD) dK : M(Rd) × M(Rd) → [0, ∞), (µ, ν) → ∥m(µ − ν)∥HK . m injective ⇐⇒ HK ⊂ C0 (Rd) dense (HK “characteristic”) ⇐⇒ dK is a metric [Borgwardt et al., 2006]. Easy to evaluate, e.g. for discrete measures since dK (µ, ν)2 = Rd × Rd K(x, y) d(µ − ν)(x) d(µ − ν)(y) ∀µ, ν ∈ M(Rd). Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 5 / 19

Regularization in Convex Analysis - Moreau envelopes Let (H, ⟨·,
·⟩, ∥ · ∥) Hilbert space, g ∈ Γ0 (H), i.e. g : H → (−∞, ∞] convex, lower semicontinuous, g ̸≡ +∞. For ε > 0, the ε-Moreau envelope [Moreau, 1965] of g, εg : H → R, x → min g(x′) + 1 2ε ∥x − x′∥2 : x′ ∈ H is convex, differentiable regularization of g preserving its min- imizers with εg(x) ↗ g(x) for ε ↘ 0. Dual formulation: [Bauschke and Combettes, 2011] εg(x) = max p∈H ⟨p, x⟩ − g∗(p) − ε 2 ∥p∥2 . The convex conjugate of g is g∗ : H → [−∞, ∞], y → sup {⟨x, y⟩ − g(y) : y ∈ H} . Moreau envelope of extended-valued non-differentiable function (top) and of | · | for different ε (bottom). ©Trygve U. Helgaker, Pontus Giselsson Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 6 / 19

f-divergences - Quantifying discrepancy between two measures Let f ∈
Γ0 (R) with f|(−∞,0) ≡ ∞, with unique minimizer at 1: f(1) = 0 and positive recession constant f′ ∞ := limt→∞ 1 t f(t) > 0 [Csiszár, 1964]. Example: fKL (x) := x ln(x) − x + 1 for x ≥ 0. Definition (f-divergence [Liero et al., 2017]) The f-divergence of µ = ρν + µs ∈ M+ (Rd) (unique Lebesgue decomposition) to ν ∈ M+ (Rd) is Df,ν (ρν + µs ) := Rd f ◦ ρ dν + f′ ∞ · µs (Rd) (∞ · 0 := 0) = sup h∈Cb(Rd), h≤f′ ∞ E µ [h] − E ν [f∗ ◦ h], E σ [h] := Rd h(x) dσ(x) Examples. (reverse) KL divergence, (reverse) χ2-divergence, Jensen-Shannon divergence, Jeffreys divergence, TV metric, Hellinger divergence, hockey-stick divergence, Marton divergence, ... Df,ν (µ) = 0 ⇐⇒ µ = ν. But: Df,ν is not symmetric, does not fulfill the triangle inequality. Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 7 / 19

1. RKHS & MMD 2. Moreau envelopes 3. f-divergences 4.
MMD-Moreau envelopes of f-divergences 5. Wasserstein gradient flow 6. WGF of MMD-Moreau envelopes of f-divergences

MMD-Regularized f-divergence - Moreau envelope interpretation We define the MMD-regularized
f-divergence functional Dλ f,ν (µ) := min Df,ν (σ) + 1 2λ dK (µ, σ)2 : σ ∈ M(Rd) , λ > 0, µ ∈ M(Rd). Theorem (Moreau envelope identification of Dλ f,ν [SNRS25]) The HK -extension of Df,ν Gf,ν : HK → [0, ∞], h →      Df,ν (µ), if ∃µ ∈ M+ (Rd) s.t. h = m(µ), ∞, else. is convex and lower semicontinuous (like Df,ν ), and λGf,ν ◦ m = Dλ f,ν . [0, ∞) HK M(Rd) [0, ∞] Gf,ν λGf,ν m Df,ν Dλ f,ν This commutes! Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 9 / 19

Properties of Dλ f,ν [SNRS25] • Dual formulation Dλ f,ν
(µ) = max E µ [p] − E ν [f∗ ◦ p] − λ 2 ∥p∥2 HK : p ∈ HK , p ≤ f′ ∞ . • ∇Dλ f,ν (µ) = argmax in Dλ f,ν (µ) is 1 λ -Lipschitz w.r.t. dK . Dλ f,ν Γ − → Df,ν , λ ↘ 0 and (1 + λ)Dλ f,ν → 1 2 dK (·, ν)2, λ → ∞. • Divergence property preserved: Dλ f,ν (µ) = 0 ⇐⇒ µ = ν ⇝ sampling makes sense. • (µ, ν) → Dλ f,ν (µ) "metrizes" weak convergence (like dK ) on M+ (Rd)-balls. Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 10 / 19

Wasserstein space and Wasserstein gradient flows The Wasserstein-2 space is
[Villani, 2008] P2 (Rd) := µ ∈ P(Rd) : Rd ∥x∥2 2 dµ(x) < ∞ . The squared distance of µ, ν ∈ P2 (Rd) is W2 (µ, ν)2 = min π∈Γ(µ,ν) Rd × Rd ∥x − y∥2 2 dπ(x, y). Vertical (L2) vs. horizontal (W2 ) mass displacement. © A. Korba Definition (Wasserstein gradient flow [Ambrosio et al., 2008]) A locally absolutely continuous curve γ : (0, ∞) → P2 (Rd) with velocity field vt ∈ Tγt P2 (Rd) is a Wasserstein gradient flow with respect to F : P2 (Rd) → (−∞, ∞] if vt ∈ −∂W2 F(γt ), for a.e. t > 0, where ∂W2 is a Wasserstein subdifferential. Wasserstein gradient flow © Petr Mokrov Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 11 / 19

Wasserstein Gradient Flow with respect to Dλ f,ν Theorem (Convexity
and gradient of Dλ f,ν [SNRS25]) Since K is radial and smooth, Dλ f,ν is M-geodesically convex with M := −8λ−1 (d + 2)φ′′(0)φ(0) and its subdifferential is ∂Dλ f,ν (µ) = ∇ argmax in Dλ f,ν (µ) . M < 0 =⇒ Dλ f,ν is “less than” convex. Corollary. There exists a unique Wasserstein gradient flow (γt )t>0 of Dλ f,ν starting at µ0 ∈ P2 (Rd), fulfilling the continuity equation ∂t γt = ∇ · γt ∂Dλ f,ν (γt ) , γ0 = µ0 . Lemma (Particle flows are W2 gradient flows [SNRS25]) If µ0 = 1 N N j=1 δxj is empirical, then so is γt for all t > 0. Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 12 / 19

Numerical Experiments - Particle Descent Algorithm Take i.i.d. samples (x(0)
j )N j=1 ∼ µ0 and (yj )M j=1 ∼ ν. Forward Euler discretization, in Wasserstein space, in time with step size τ > 0: γn+1 := (id −τ∇ˆ pn )# γn , ˆ pn = argmax in Dλ f,ν (γn ) so γn = 1 N N j=1 δ x(n) j with gradient step x(n+1) j = x(n) j − τ∇ˆ pn x(n) j , j ∈ {1, . . . , N}, n ∈ N . Theorem (Representer-type theorem [SNRS25]) If f′ ∞ = ∞ or if λ > 2dK (γn , ν) √ φ(0) f′ ∞ , then ˆ pn solves finite-dimensional strongly convex problem. To find ˆ pn , we use L-BFGS-B, a quasi-Newton method, or FISTA on the primal problem. We use annealing w.r.t. λ if f′ ∞ < ∞. Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 13 / 19

Numerical experiments Fig. 1: IMQ kernel K(x, y) = (σ2
+ ∥x − y∥2 2 )− 1 2 , λ = 10−2 τ = 10−3, Tsallis-3 divergence: f3(x) = 1 2 (x3 − 3x + 2), x ≥ 0. Fig. 2: Number of starting particles N, less than number of samples of target, M ⇝ quantization Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 14 / 19

Annealing vs no annealing t=0 t=1 t=10 t=50 t=100 Fig.
3: WGF of the regularized 1 2 -Tsallis divergence Dλ f 1 2 ,ν without (top) and with annealing (bottom), where ν is the three rings target. Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 15 / 19

Effect of the kernel bandwidth 10−3 10−2 10−1 1 10
t=0 t=0.1 t=1 t=3 t=10 t=25 Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 16 / 19

Further work • Non-differentiable (e.g. Laplace) and unbounded (e.g. Riesz,
Coulomb) kernels. • Regularize other divergences, e.g. Rényi divergences, Bregman divergences, restricted f-divergences, other exponents than p = 2. • Different domains, e.g. compact subsets of Rd, groups, locally compact infinite-dimensional spaces. • Convergence rates in suitable metric. • Consistency bounds and better M-convexity estimates. Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 17 / 19

Conclusion • novel objective Dλ f,ν . Minimizing it allows
sampling from a target measure of which only samples are known. • Clear, rigorous interpretation using Convex Analysis and RKHS. • Theory covers (almost) all f-divergences ⇝ allows different geometries. • Best of both worlds: Dλ f,ν interpolates between Df,ν and 1 2 dK (·, ν)2. • Effective algorithms due to (modified) representer theorem & GPU / PyTorch. Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 18 / 19

Thank you for your attention! I am happy to take
any questions. Paper: arxiv.org/abs/2402.04613 Code: github.com/ViktorAJStein/Regularized_f_Divergence_Particle_Flows My website: viktorajstein.github.io [Ambrosio et al., 2008, Birrell et al., 2022, Glaser et al., 2021, Hertrich et al., 2024, Kremer et al., 2023, Leclerc et al., 2020, Liero et al., 2017, Terjék, 2021, Nguyen et al., 2007, Borgwardt et al., 2006] Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 19 / 19

References I [Ambrosio et al., 2008] Ambrosio, L., Gigli, N.,
and Savaré, G. (2008). Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2 edition. [Aronszajn, 1950] Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc., 68(3):337–404. [Bauschke and Combettes, 2011] Bauschke, H. and Combettes, P. (2011). Convex Analysis and Monotone Operator Theory in Hilbert Spaces. CMS Books in Mathematics. Springer New York. [Birrell et al., 2022] Birrell, J., Dupuis, P., Katsoulakis, M. A., Pantazis, Y., and Rey-Bellet, L. (2022). (f, Γ)-divergences: Interpolating between f-divergences and integral probability metrics. J. Mach. Learn. Res., 23(39):1–70. Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 1 / 13

References II [Borgwardt et al., 2006] Borgwardt, K. M., Gretton,
A., Rasch, M. J., Kriegel, H.-P., Schölkopf, B., and Smola, A. J. (2006). Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49–e57. [Csiszár, 1964] Csiszár, I. (1964). Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizität von Markoffschen Ketten. Magyar. Tud. Akad. Mat. Kutato Int. Kozl., 8:85–108. [Glaser et al., 2021] Glaser, P., Arbel, M., and Gretton, A. (2021). KALE flow: A relaxed KL gradient flow for probabilities with disjoint support. In Advances in Neural Information Processing Systems, volume 34, pages 8018–8031, Virtual event. Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 2 / 13

References III [Hertrich et al., 2024] Hertrich, J., Wald, C.,
Altekrüger, F., and Hagemann, P. (2024). Generative sliced MMD flows with Riesz kernels. In International Conference on Learning Representations (ICLR), Vienna, Austria. [Kremer et al., 2023] Kremer, H., Y., N., Schölkopf, B., and Zhu, J.-J. (2023). Estimation beyond data reweighting: kernel methods of moments. In ICML’23: Proceedings of the 40th International Conference on Machine Learning, volume 202, page 17745–17783, Honolulu, Hawaii, USA. [Leclerc et al., 2020] Leclerc, H., Mérigot, Q., Santambrogio, F., and Stra, F. (2020). Lagrangian discretization of crowd motion and linear diffusion. SIAM J. Numer. Anal., 58(4):2093–2118. Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 3 / 13

References IV [Liero et al., 2017] Liero, M., Mielke, A.,
and Savaré, G. (2017). Optimal entropy-transport problems and a new Hellinger–Kantorovich distance between positive measures. Invent. Math., 211(3):969–1117. [Moreau, 1965] Moreau, J.-J. (1965). Proximité et dualité dans un espace Hilbertien. Bulletin de la Société mathématique de France, 93:273–299. [Nguyen et al., 2007] Nguyen, X., Wainwright, M. J., and Jordan, M. (2007). Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In Platt, J., Koller, D., Singer, Y., and Roweis, S., editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc. Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 4 / 13

References V [Steinwart and Christmann, 2008] Steinwart, I. and Christmann,
A. (2008). Support Vector Machines. Springer Science & Business Media. [Terjék, 2021] Terjék, D. (2021). Moreau-Yosida f-divergences. In International Conference on Machine Learning (ICML), pages 10214–10224, Virtual event. PMLR. [Villani, 2008] Villani, C. (2008). Optimal transport: old and new, volume 338 of Grundlehren der mathematischen Wissenschaften. Springer Berlin, Heidelberg, 1 edition. Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 5 / 13

Generalized geodesics in Wasserstein space Definition (Generalized geodesic convexity [Ambrosio
et al., 2008]) A function F : P2 (Rd) → (−∞, ∞] is M-convex along generalized geodesics with M ∈ R if, for every σ, µ, ν ∈ dom(F), there exists a α ∈ P2 (R3d) with (P1,2 )# α ∈ Γopt(σ, µ) and (P1,3 )# α ∈ Γopt(σ, ν) s.t. F (1−t)P2 +tP3 # α ≤ (1−t) F(µ)+t F(ν)− M 2 t(1−t) Rd × Rd × Rd ∥y−z∥2 2 dα(x, y, z), ∀t ∈ [0, 1]. Fig. 5: Generalized geodesic from µ2 to µ3 with base µ1 [Ambrosio et al., 2008]. Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 6 / 13

Absolute continuity and the Fréchet subdifferential [Ambrosio et al., 2008]
A curve γ : (0, ∞) → P2 (Rd) is absolutely continuous if ∃ L2-Borel velocity field v : Rd ×(0, ∞) → Rd s.t. ∂t γt + ∇ · (vt γt ) = 0, (t, x) ∈ (0, ∞) × Rd, weakly. (Continuity Eq.) Definition (Fréchet subdifferential in Wasserstein space) The (reduced) Fréchet subdifferential of F : P2 (Rd) → (−∞, ∞] at µ ∈ dom(F) is ∂ F(µ) := ξ ∈ L2(Rd; µ) : F(ν) − F(µ) ≥ inf π∈Γopt(µ,ν) Rd × Rd ⟨ξ(x1 ), x2 − x1 ⟩ dπ(x, y) + o(W2 (µ, ν)) Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 7 / 13

Entropy functions Examples. fKL (x) := x ln(x) − x
+ 1 for x ≥ 0 yields the Kullback-Leibler divergence and fα (x) := 1 α−1 (xα − αx + α − 1) the Tsallis-α divergence Tα for α > 0. In the limit: T1 = KL. −0.5 0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 x ln(x) − x + 1 |x − 1| (x − 1) ln(x) x ln(x) − (x + 1) ln x+1 2 max(0, 1 − x)2 Left: Examples of entropy functions, except the red. Right: The functions fα for α ∈ [0.1, 2.5]. Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 8 / 13

Interlude: Regularizing tight f-divergences Let f′ ∞ = ∞. Consider
the restricted objective [?, Nguyen et al., 2007, Terjék, 2021] ˜ Df,ν : M(Rd) → [0, ∞], µ → Df,ν (µ)+ιP(Rd) (µ). Then for all µ, ν ∈ P(Rd) we have Df,ν (µ) = sup g∈Cb(Rd) {E µ [g] − ( ˜ Df,ν )∗(g)}, (1) where ( ˜ Df,ν )∗(g) = inf Rd f∗(g(x) + θ) dν(x) − θ : θ ∈ R . Idea: As before, restrict (1) to HK , add, regularizer −λ 2 ∥ · ∥2 HK : ˜ Dλ f,ν (µ) := max h∈HK E µ [h] − ( ˜ Df,ν )∗(h) − λ 2 ∥h∥HK 2 , µ ∈ P(Rd). Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 9 / 13

Theorem (Regularized tight f-divergence [SNRS’25]) The function ˜ Gf,ν :
HK → [0, ∞], h →      Df,ν (µ), if ∃µ ∈ P(Rd) such that h = mµ , ∞, else, is lower semicontinuous and ˜ Gf,ν ∈ Γ0 (HK ). We have ˜ Df,ν = ˜ Gf,ν ◦ m and ˜ Dλ f,ν = ˜ Gλ f,ν ◦ m. The primal formulation is ˜ Dλ f,ν (µ) = min σ∈P(Rd) Df,ν (σ) + 1 2λ dK (σ, ν)2 . Same gradient estimates & primal dual relationship hold.

WGF of regularized tight f-divergences t=0.1 t=1 t=10 t=50 Fig.
6: WGF of the tight version ˜ Dλ f,ν (top) and non-tight one Dλ f,ν (bottom): Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 11 / 13

Reproducing Kernel Hilbert Spaces “Kernel trick”: embed data into high-dimensional
Hilbert space. φ ∈ C∞((0, ∞)) ∩ C2([0, ∞)), (−1)kφ(k)(r) ≥ 0, ∀k ∈ N, r > 0. ⇝ reproducing kernel Hilbert space (RKHS) HK := span({K(x, ·) : x ∈ Rd}). Key property: h → h(x) cts. 0.5 1 1.5 2 2.5 3 0.2 0.4 0.6 0.8 1 1.2 r Examples (with parameter s > 0) Gaussian exp − 1 2s r Inverse multiquadric (s + r)− 1 2 √ s spline (1 − √ r)3 + Nonexamples. • Laplace φ(r) = exp(− 1 2s √ r) (not smooth enough) • K(x, y) = ∥x∥ + ∥y∥ − ∥x − y∥ (not radial) “Kernel trick”. Source: songcy.net/posts/story-of-basis-and-kernel-part-2/ Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 12 / 13

Shameless plug: other works Interpolating between OT and KL regularized
OT using Rényi Divergences Rényi divergence ̸∈ {f-div., Bregman div.}, α ∈ (0, 1) Rα (µ | ν) := 1 α − 1 ln X dµ dτ α dν dτ 1−α dτ , OTε,α (µ, ν) := min π∈Π(µ,ν) ⟨c, π⟩ + εRα (π | µ ⊗ ν) is a metric, where ε > 0, µ, ν ∈ P(X), X compact. OT(µ, ν) α↘0 ← − − − − or ε→0 OTε,α (µ, ν) α↗1 − − − → OTKL ε (µ, ν). In the works: debiased Rényi-Sinkhorn divergence OTε,α (µ, ν) − 1 2 OTε,α (µ, µ) − 1 2 OTε,α (ν, ν). W2 gradient flows of dK (·, ν)2 with K(x, y) := −|x − y| in 1D. Reformulation as maximal monotone inclu- sion Cauchy problem in L2 (0, 1) via quantile functions. Comprehensive description of solutions’ behav- ior, instantaneous measure-to-L∞ regularization, implicit Euler is simple. Viktor Stein W2 Gradient Flows of MMD-Moreau Envelopes of f-Divergences in RKHSs 22.09.2025 13 / 13 −1 −0.5 0.5 1 1.5 2 1 2 3 µ0 8 6 4 2 0 2 4 6 8 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Iteration 0 initial target explicit implicit

Wasserstein gradient flow of Moreau envelopes o...

Wasserstein gradient flow of Moreau envelopes of f-divergences in reproducing kernel Hilbert spaces - MML'25 version

Viktor Stein

More Decks by Viktor Stein

Other Decks in Research

Featured

Transcript

Wasserstein Gradient Flows of Moreau Envelopes of f-Divergences in Reproducing

Goal. Recover ν ∈ P(Rd) from samples by minimizing f-divergence

Concurrent and prior work • KALE functional = MMD-regularized KL

1. RKHS & MMD 2. Moreau envelopes 3. f-divergences 4.

Kernel mean embedding and Maximum Mean Discrepancy K : Rd

Regularization in Convex Analysis - Moreau envelopes Let (H, ⟨·,

f-divergences - Quantifying discrepancy between two measures Let f ∈

1. RKHS & MMD 2. Moreau envelopes 3. f-divergences 4.

MMD-Regularized f-divergence - Moreau envelope interpretation We define the MMD-regularized

Properties of Dλ f,ν [SNRS25] • Dual formulation Dλ f,ν

Wasserstein space and Wasserstein gradient flows The Wasserstein-2 space is

Wasserstein Gradient Flow with respect to Dλ f,ν Theorem (Convexity

Numerical Experiments - Particle Descent Algorithm Take i.i.d. samples (x(0)

Numerical experiments Fig. 1: IMQ kernel K(x, y) = (σ2

Annealing vs no annealing t=0 t=1 t=10 t=50 t=100 Fig.

Effect of the kernel bandwidth 10−3 10−2 10−1 1 10

Further work • Non-differentiable (e.g. Laplace) and unbounded (e.g. Riesz,

Conclusion • novel objective Dλ f,ν . Minimizing it allows

Thank you for your attention! I am happy to take

References I [Ambrosio et al., 2008] Ambrosio, L., Gigli, N.,

References II [Borgwardt et al., 2006] Borgwardt, K. M., Gretton,

References III [Hertrich et al., 2024] Hertrich, J., Wald, C.,

References IV [Liero et al., 2017] Liero, M., Mielke, A.,

References V [Steinwart and Christmann, 2008] Steinwart, I. and Christmann,

Generalized geodesics in Wasserstein space Definition (Generalized geodesic convexity [Ambrosio

Absolute continuity and the Fréchet subdifferential [Ambrosio et al., 2008]

Entropy functions Examples. fKL (x) := x ln(x) − x

Interlude: Regularizing tight f-divergences Let f′ ∞ = ∞. Consider

Theorem (Regularized tight f-divergence [SNRS’25]) The function ˜ Gf,ν :

WGF of regularized tight f-divergences t=0.1 t=1 t=10 t=50 Fig.

Reproducing Kernel Hilbert Spaces “Kernel trick”: embed data into high-dimensional

Shameless plug: other works Interpolating between OT and KL regularized